LibrarySecond Way: FeedbackA/B Testing

FB-04TOOLSecond Way: Feedback

A/B Testing

Let data decide. How to form a testable hypothesis, run a controlled experiment in production, and avoid the most common statistical traps.

Sources:Lean StartupDevOps Handbook

Video Lesson

A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.

01

What is A/B testing?

An A/B test is a controlled experiment where two versions of a feature are shown to different groups of users simultaneously, and a metric is measured to determine which version performs better. Version A is the control (existing behavior). Version B is the treatment (new behavior).

A/B testing is the production implementation of hypothesis-driven development: you form a hypothesis about user behavior, build the smallest test that could prove or disprove it, and let the data decide — not intuition, seniority, or design opinion.

Eric Ries, in The Lean Startup, calls this validated learning. The goal of every product decision is to generate validated knowledge about what users actually want — not what we think they want.

02

How to run an A/B test

1

Form a falsifiable hypothesis

"We believe that changing the checkout button from 'Complete Order' to 'Buy Now' will increase checkout completion rate by 5% for users who reach the payment page." Specific. Measurable. Falsifiable.

2

Choose a single primary metric

One metric per test. If you measure 20 metrics and declare victory on whichever one improves, you will find a false positive every time. The metric must be chosen before the test runs.

3

Calculate the required sample size

Use a power analysis to determine how many users you need to detect the effect size you care about at your desired confidence level. Small effects need large samples. Do this before you start.

4

Run for the required duration

Do not stop early when you see a promising result. Run until you have the predetermined sample size. Stopping early dramatically increases false positive rates.

5

Analyze and decide

If the result is statistically significant at your threshold (typically p < 0.05) and the effect size is practically meaningful, ship the winner. Otherwise, the null hypothesis stands.

03

Statistical significance

Statistical significance answers the question: how likely is this result to have occurred by chance? A p-value of 0.05 means there is a 5% chance of seeing this result even if the treatment has no effect. This is a threshold for decision-making, not a measure of importance.

Trap: Peeking

Checking results before the test is complete and stopping when you see what you want. Increases false positive rate from 5% to over 30% at p < 0.05.

Fix

Pre-register your sample size. Do not look at results until you have it.

Trap: Multiple metrics

If you test 20 metrics at p < 0.05, one will appear significant by chance. This is the multiple comparisons problem.

Fix

One primary metric. Secondary metrics are exploratory, not conclusive.

Trap: Novelty effect

Users engage with anything new. Short tests capture novelty, not sustained behavior change. A week-long test may show a positive that disappears after day 3.

Fix

Run tests long enough to see post-novelty behavior. At minimum, one full week.

Trap: Segment isolation

Users who see both variants — due to cookie clearing, device switching, or VPN — corrupt the experiment.

Fix

Assign variants by stable user ID, not cookie. Exclude users with variant exposure contamination.

04

A/B testing in production

A/B testing at the infrastructure level is implemented using feature flags. Traffic is split by user segment — typically a percentage of users assigned to each variant at login or session creation. The flag system routes each user to their assigned variant consistently.

// Variant assignment — consistent per user

const variant = experiment.getVariant('checkout_button', userId);

// variant === 'control' | 'treatment'

if (variant === 'treatment') {

analytics.track('checkout_button_seen', { variant: 'buy_now' });

return <button>Buy Now</button>

}

return <button>Complete Order</button>

The experiment platform records every exposure and conversion event, calculates statistical significance continuously, and surfaces results in a dashboard. Engineers ship the winning variant by updating the flag default and eventually removing the flag.

05

Beyond A/B

Multivariate testing

Test multiple variables simultaneously — button color AND button text AND page layout. More efficient than sequential A/B tests, but requires much larger sample sizes. Use sparingly.

Bandit algorithms

Adaptive experiments that dynamically shift traffic toward the winning variant as data accumulates. Minimize regret (lost conversions during the test). Best for short-horizon decisions with clear metrics.

Holdout groups

A permanently held-out group of users who never see new features. Enables long-term measurement of the cumulative effect of all product changes over months. Expensive but illuminating.

Interleaving

Used in ranking systems: show results from both algorithms in a single interleaved list, detect preference from which results users click. Dramatically more sensitive than standard A/B for ranking problems.

06

Further reading

Trustworthy Online Controlled Experiments — Kohavi et al.

The definitive book on A/B testing at scale. Every trap described in this article is covered with case studies from Microsoft, LinkedIn, and Airbnb.

The Lean Startup — Eric Ries

The origin of hypothesis-driven development. Chapter 7: Measure. The build-measure-learn feedback loop as organizational practice.

DevOps Handbook — Chapter 22

Create Telemetry to Enable Seeing and Solving Problems. A/B testing as part of the production feedback infrastructure.

Evan Miller — A/B Testing Statistics

evanmiller.org. Clear explanations of statistical concepts for engineers. The sequential testing and sample size calculators are used industry-wide.