What makes an end-to-end test flaky?

Almost always one of five root causes: a timing assumption (waiting a fixed number of milliseconds instead of waiting for a real condition), shared state between tests (one test's database writes bleed into another), a network or animation race (the UI is still mid-transition when the assertion fires), test-order dependence (a test assumes another ran first), or retries masking a real problem rather than fixing it.

Should I use retries to fix flaky tests?

Retries are a quarantine tool, not a fix. Using retries without also tracking and fixing the underlying cause lets flake accumulate until your suite is meaningless. The right sequence: quarantine the flaky test with a retry so CI stays green, file a ticket, then fix the root cause and remove the retry.

How do I make test data hermetic without reseeding the whole database every run?

Create all the data a test needs at the start of that test via API or factory, and tear it down (or mark it deleted) at the end. Never share records across tests unless they are immutable seed data. If your test framework supports it, wrap each test in a transaction that rolls back — this is far faster than reseeding and completely eliminates data leakage between tests.

What's the difference between a slow test and a flaky test?

A slow test always takes a long time. A flaky test sometimes passes and sometimes fails — the failure isn't deterministic. Slowness is a performance problem; flakiness is a correctness problem. Flaky tests are worse because they corrode trust: developers start ignoring failures, which is how real regressions slip into production undetected.

How to fix flaky end-to-end tests (and stop re-running CI)

A flaky test is a broken trust contract. It says: this suite might pass or fail for reasons unrelated to your code. Once your team learns to re-run CI until it goes green, the entire test suite stops meaning anything — and real regressions start shipping to production undetected.

The good news: flaky e2e tests are almost never random. They are deterministic failures triggered by conditions your test doesn't control yet. Understand the five root causes and you can fix them for good instead of masking them with retries.

Root cause 1: timing assumptions

The most common source of flake is sleep(500) or any fixed-duration wait. The page might render in 120ms on a developer laptop and 800ms on a loaded CI runner. Either the sleep is too short and the assertion fires before the element exists, or it's so long it makes the suite unbearably slow.

The fix: deterministic waits

Modern browser testing frameworks — Playwright, Cypress — have built-in auto-wait: they poll until the element is present, visible, and stable before acting. Use these instead of any sleep:

Wait for a specific element to be visible: await page.waitForSelector('[data-testid="confirm-modal"]')
Wait for a network request to settle: await page.waitForResponse(r => r.url().includes('/api/checkout'))
Wait for a navigation: await page.waitForURL('/dashboard')

The rule: never assert against a state you haven't explicitly waited for. If you find yourself writing await sleep(n), replace it with a wait for the condition that n milliseconds was meant to approximate.

Root cause 2: shared state and non-hermetic fixtures

Test A creates a user, completes a flow, and leaves records in the database. Test B assumes it starts from a clean state, but it doesn't — it inherits whatever Test A left behind. The tests pass in isolation and fail in suite. Worse, they fail intermittently depending on which test runs first or how fast parallel workers produce writes.

The fix: hermetic, seeded data per test

Each test must own its data. The practical patterns, from fastest to most isolated:

Transaction rollback. Wrap each test in a database transaction that rolls back at teardown. This is essentially free — no reseed, no teardown query, instant isolation. Works for most backend-integrated suites.
Factory creation + explicit teardown. Create the specific records you need via API or a test factory at the start of each test; delete them (or soft-delete) at the end. This is the right pattern for tests that can't share a database transaction with the app.
Seed + reset. A full reseed before each test is correct but slow. Reserve it for small data sets or suites where the speed cost is acceptable.

The invariant: a test must not care which other tests ran before it, in what order, or whether they passed. If it does, the test has an implicit dependency on external state — find it and remove it.

Root cause 3: network and animation races

Your test clicks a button. The button triggers a fetch. The assertion checks for the success banner. But the assertion fires while the fetch is still in flight — the banner isn't there yet — and the test fails. This is a network race. The same pattern appears with CSS animations and transitions: you assert the position of an element that is still mid-animation.

The fix: control the environment

For network races, intercept and control the requests in your test:

Mock slow or flaky external APIs. In Playwright: await page.route('**/api/external', route => route.fulfill({...})). This removes network latency entirely for that endpoint.
Wait for the response, not a fixed time. Use page.waitForResponse() tied to the specific request your action triggers.
Disable animations globally in test mode. Add a CSS class or env flag your tests set: *, *::before, *::after { animation-duration: 0ms !important; transition-duration: 0ms !important; }. Playwright has page.emulateMedia({ reducedMotion: 'reduce' }) for the same effect.

For time-dependent logic (token expiry, scheduled tasks, debounced inputs), inject a clock your tests can control. Jest has jest.useFakeTimers(); for browser tests, Playwright supports injecting a fake Date via page.clock.install().

Root cause 4: test-order dependence

Test C relies on the user that Test A created. Or it relies on a cookie set by Test B. Run them in a different order — or in parallel — and Test C fails with a 404 or a redirect to the login page. The tests look like unit tests but they're secretly a workflow that must run in sequence.

The fix: make every test self-contained

Each test must set up its own preconditions. If a test needs an authenticated user, it must log in itself — not rely on a session left by a previous test. If it needs a specific record, it creates it. The setup cost feels annoying until you realize it's exactly what gives you confidence to parallelize: tests that own their preconditions can run in any order, on any worker, simultaneously.

If setup genuinely is expensive (say, a 10-second OAuth flow), cache it at the suite level — create the session once and share it across tests as read-only fixture state. But any writes within the test must be isolated.

Root cause 5: retries used as a crutch

Flaky tests get retried. Three retries per test, five-minute CI run, you have fifteen minutes of wasted time and zero signal about what's wrong. Worse: if the test fails on the first two attempts and passes on the third, it's marked green. A real regression that triggers the flake condition 70% of the time will slip through.

The fix: quarantine, track, and fix

Retries are a quarantine tool, not a cure. The correct sequence:

Add a retry to the broken test so CI doesn't block the team right now.
File a ticket immediately. Name the test, describe the observed failure mode, link to the CI run. This is not optional — it's the contract that says "this retry is temporary."
Fix the root cause using the patterns above, then remove the retry.

Track your flaky test count as a metric. If it's growing, your test suite is rotting. A team that allows indefinite retries without fixing root causes will eventually have a suite where nothing is trusted — and the cost of that is invisible regressions, not slower CI.

Connecting test failures to real bugs

Even a deterministic, non-flaky e2e test can tell you that something broke without telling you what broke or why. That's why it's worth pairing your test suite with a way to capture what was on screen when it failed: the actual UI state, the network errors, the console output at the moment of failure. That's the same evidence that makes a manual bug report reproducible — and it's equally useful when the reporter is an automated test rather than a human. Tools like Klavity AutoSim run AI personas through your product continuously and file grounded bug reports when they hit failures — the same kind of evidence-capture that makes the difference between a ticket that gets fixed and one that gets closed as "cannot reproduce."