How Long Should You Run an A/B Test? The Complete Duration Guide

One of the most common questions merchants ask when setting up experiments is how long to run a/b test so the results are reliable and useful. The answer is not a fixed number of days; it depends on traffic, baseline conversion rate, the smallest change you want to detect, and natural fluctuations in your store. This article gives a practical, step by step methodology to calculate test duration, plus examples and dos and don'ts for Shopify store owners who want to optimise titles, descriptions and prices.

Why duration matters: reliability, business cycles and decision risk

Run a test for too short a period and you risk false positives caused by random variation. Run it for too long and you waste time and potentially revenue, especially if a winner is already apparent. Duration matters because it balances two types of risk: false positives; and missed opportunities due to slow decision making. In practice you should plan test duration around three things: statistical requirements; business cycles and seasonality; and traffic cadence on the pages you are testing.

Key variables that determine a/b test duration

Baseline conversion rate: the current conversion rate on the page or funnel step you are testing. Lower baseline rates require larger samples to detect the same relative uplift.
Minimum Detectable Effect (MDE): the smallest relative change you care about detecting; for product titles you might target a 10 to 20 percent uplift, while for price tests you may need smaller MDEs for profitability.
Traffic: the number of unique visitors who qualify for the test per day. This is the most direct driver of duration. More traffic shortens tests.
Statistical confidence and power: common choices are 95 percent confidence and 80 percent power. Higher confidence or power increases required sample size and therefore duration.
Allocation: the number of variants and traffic split. A two-variant 50/50 test needs fewer total visitors than a test with multiple arms.
Seasonality and business cycles: weekly patterns, promotions and holidays can bias results if your test does not span representative cycles.

How to calculate required sample size

Sample size depends on baseline conversion rate, MDE and chosen statistical thresholds. The classic frequentist approach for two variants uses z-scores for confidence and power. If you prefer not to do the maths, many online calculators can compute sample sizes given your inputs. Below is a clear way to estimate sample size and turn it into a test duration.

Choose baseline conversion rate p. Use recent data from the page where the experiment will run; for example, a product page that converts at 2 percent has p = 0.02.
Decide on MDE as an absolute difference or as a relative percent. For a 20 percent relative uplift from 2 percent baseline: p2 = 0.02 × 1.20 = 0.024; absolute delta = 0.004.
Choose significance level (alpha) and power (1 − beta). Typical choices: alpha = 0.05 (95 percent confidence), power = 0.80.
Use a sample size calculator or the two-proportion formula. For a quick rule of thumb, the sample needed per variant for small conversion rates is often tens of thousands when you want to detect small relative changes.

Example calculation: baseline 2 percent, 20 percent relative uplift, 95 percent confidence and 80 percent power. The required sample per variant will be about 19,700 visitors. That means approximately 39,400 visitors in total for a 50/50 test. If your product page attracts 500 unique visitors per day who qualify for the test, then expected duration is roughly 79 days. If you get 2,000 qualifying visitors a day, duration falls to about 20 days.

Practical example scenarios

Here are concrete scenarios for Shopify merchants to make the numbers relatable. All calculations assume a simple two-variant split and conversion rate measured per visitor to the tested page.

High-traffic product with small MDE: baseline 5 percent CR, MDE 10 percent relative (0.5 percent absolute), 95 percent confidence and 80 percent power. Required per variant: ~15,000 visitors. If the page gets 3,000 visitors a day, time to reach sample: 5 days. Include additional days to cover weekly patterns; run for 10 days minimum.
Typical store product title test: baseline 2 percent CR, MDE 20 percent relative (0.4 percent absolute), same confidence and power. Required per variant: ~19,700 visitors. At 500 visitors a day, you need about 79 days; at 2,000 visitors a day, about 20 days.
Price test measuring revenue per visitor: revenue metrics are usually more variable than conversion flags, so you often need larger samples. For modest size stores expect doubling the sample requirement. If the CR-based test needed 20,000 per variant, a revenue-per-visitor test might need 40,000 or more per variant.

Translating sample size into days: the simple formula

Use this approach to turn required sample size into duration:

Estimate qualifying visitors per day to the page or funnel step you are testing.
Divide required visitors per variant by visitors per day allocated to each variant. If allocation is 50/50, visitors per variant per day equals total visitors per day divided by two.
Ensure the result spans at least one or two full business cycles; add time if there is known weekly or monthly seasonality.

Example: required per variant 20,000; page gets 1,000 visitors per day; with 50/50 split each variant gets 500 visitors per day. Duration = 20,000 / 500 = 40 days. Add 7 days to cover the weekly cycle; plan for ~47 days total.

When to end a/b test: rules you can use

Knowing when to end a/b test is as important as calculating the required length. Here are reliable stopping rules that reduce decision risk.

Primary rule: meet your predefined sample size and run through at least one full business cycle. When sample size and time criteria are met, then check whether your primary metric has reached statistical significance.
Significance and consistency: a statistically significant result that is consistent across major segments and business days is a strong indicator to stop. Check by device type, traffic source and new versus returning customers.
No early peeking: avoid stopping just because a variant looks better after a few days. Repeatedly checking p-values inflates false positive risk. If you use sequential testing or Bayesian methods, design the test with those rules up front.
Practical business rule: if a variant shows a materially worse performance after a sufficient sample, you may stop early to limit losses. Define what "sufficient sample" means before the test starts.
Inconclusive after full cycle: if you hit sample size and run for the planned duration but have no significant difference, you can either declare a tie; increase sample size and rerun; or change the treatment to a bolder variation and re-test.

Common pitfalls that extend tests unnecessarily

These mistakes either falsely shorten or lengthen tests and are avoidable with simple rules.

Stopping early due to a lucky spike: avoid acting on short-term noise.
Wrong qualification funnel: if you include non-representative visitors in the sample, conversion rates and variance estimates will be distorted.
Ignoring seasonality: tests that overlap promotions or weekends only may be biased. Ensure representative coverage.
Checking secondary metrics first: focus on your primary metric for stopping rules; secondary metrics can guide interpretation but should not be the primary stop condition.
Too many variants: testing many variants increases total sample needs and duration; prioritise the most promising changes first.

Low-traffic strategies for stores with limited visitors

If your Shopify store has limited traffic, long test durations can be impractical. Here are approaches to still learn efficiently.

Increase MDE by testing bolder ideas: large relative changes need smaller samples. Try a radical headline, a different value proposition or a distinct price bracket to increase detectable effect.
Use within-subject or paired tests: where appropriate, measure changes for the same customers across sessions, though this approach is not always feasible on e-commerce product pages.
Aggregate similar pages: test across a set of similar product pages rather than one product page if the change is applicable across SKUs.
Run sequential or Bayesian tests designed for continuous monitoring: these methods allow valid interim checks when implemented correctly; they often require different design and analysis but can be more flexible for low-traffic stores.
Prioritise qualitative research: for small stores, customer interviews, session recordings and usability testing can uncover high-impact hypotheses to test when traffic permits.

Adjusting for multiple variants and multiple tests

Testing more than two variants changes the maths. Each additional variant increases the chance of a false positive unless you correct for multiple comparisons. Popular corrections include Bonferroni adjustments and more efficient approaches like controlling the false discovery rate. Practically, if you run a three-variant test and want to preserve the same confidence, expect the sample per variant to increase; a safe approach is to treat pairwise comparisons as separate tests for planning purposes, or to narrow down options with pre-testing.

Special note: price tests and revenue metrics

Price and revenue-per-visitor tests typically require larger samples because revenue has higher variance than a binary conversion flag. When you test prices:

Measure revenue per visitor or margin per visitor rather than conversion rate alone.
Estimate variance from historical order values; use that to calculate required sample size for a given detectable difference in revenue per visitor.
Consider business impact: even a small percentage uplift in revenue per visitor can be highly valuable; ensure you test for profitable ranges and check downstream effects like returns.

Practical checklist before you start a test

Follow this checklist so your test duration estimates are accurate and your results are reliable.

Record baseline conversion rate and variance for the specific page or funnel step.
Decide MDE in business terms; what uplift is worth acting on?
Choose confidence and power; common is 95 percent confidence and 80 percent power.
Estimate qualifying visitors per day and allocate sample split.
Calculate sample size and therefore estimated duration; add buffer for full business cycles.
Define stopping rules and metrics up front to avoid peeking bias.
Document segments you will check after the test: device, traffic source, returning vs new customers.

How ConvertLab helps with duration and stopping rules

Tools that automate the maths remove a lot of the friction. ConvertLab provides automatic significance calculation and simple configuration of allocation and goals for Shopify stores, so you do not have to manually run sample size formulas every time. It integrates with Shopify product and collection pages, making it straightforward to set up title, description and price tests and monitor progress.

Interpreting results: beyond p-values

When your test finishes, statistical significance is only the start. Consider these steps before you implement changes site-wide.

Check business significance: a statistically significant 1 percent uplift may not be worth the change if it costs resources or harms brand perception.
Verify across segments: ensure the winner performs across major customer segments. A variant benefiting only mobile users may still be valuable, but you need to know the scope.
Look at secondary metrics: bounce rate, add-to-cart rate, average order value and return rate. Interpret trade-offs carefully.
Run a follow-up experiment: if effect size is small or you have concerns about novelty, run a follow-up test with the winning variant as the new control to validate stability.
Document learnings: keep a results log with test name, hypothesis, effect size and business decision. This builds institutional knowledge for future tests.

When test duration should be longer than the calculation says

Even when you hit sample size and see statistical significance, some practical reasons justify a longer run:

Seasonality hitting during your test period that you did not plan for.
Traffic source mix shifts during the test window due to promotions or paid campaigns.
Large downstream impacts that require longer observation, for example changes in average order value, returns or lifetime value.
Need for more data on subsegments that are critical to business strategy.

When shorter tests make sense

Shorter tests can be valid if you design them with the right methods:

Use sequential or Bayesian testing frameworks built for continuous monitoring and predefined stopping rules.
Accept a higher MDE by testing bold changes; this reduces sample needs and shortens tests.
Test in controlled environments such as limited-time campaigns where fast decisions are needed; be explicit about the trade-offs.

Summary: practical duration rules for Shopify merchants

If you have high traffic: plan for 1 to 3 weeks for typical product text tests; adjust for MDE and variance.
If you have moderate traffic: expect several weeks to a couple of months depending on MDE; use a clear sample size calculation.
If you have low traffic: expect months for small MDEs; instead test bolder changes, aggregate pages, or use sequential/Bayesian methods.
Always predefine MDE, alpha and power; run at least one full business cycle and avoid peeking without proper sequential methods.

Next steps and practical template

Use this simple template to plan your next A/B test:

Metric to optimise: e.g., product-to-cart rate or revenue per visitor.
Baseline conversion or value: calculate using recent 14 to 30 days of data.
MDE: choose the minimum uplift that would change your decision.
Confidence and power: typically 95 percent and 80 percent.
Expected qualifying visitors per day: measure on the specific page.
Calculate required sample and convert to days; add one business cycle buffer.
Document stopping rules; run the test; check segments at the end.

Conclusion

Deciding how long to run a/b test comes down to planning with data: baseline conversion, desired MDE, traffic and chosen confidence and power. For many Shopify merchants the most practical approach is to calculate the sample size up front, convert that to days using qualifying visitors per day, and then run through at least one full business cycle. When traffic is limited, use bolder tests, aggregate pages or sequential/Bayesian methods to learn faster. After a test concludes, examine both statistical and business significance before applying changes site-wide.

CTA: try ConvertLab

ConvertLab calculates statistical significance automatically — you'll know exactly when your test has a clear winner. Get started and simplify your A/B testing workflow on Shopify: ConvertLab on the Shopify App Store.

📚 Want to dive deeper?

This post is part of our comprehensive A/B testing series.

Read the Complete Guide to A/B Testing Product Descriptions →