This is for everyone experimenting with anything. You’re probably going much slower than you need to.
In This Whitepaper
In this whitepaper we will rebuild the statistical foundations of your digital experimentation program from the axioms-up.
In part 1 here we walk through the momentous statistical misunderstanding holding you back & offer a practical solution to this specific problem. In later instalments we’ll go on to:
- Examine the particular importance of these concepts for mature personalisation programs.
- Take a look at how we adjust revenue projections to accommodate low confidence.
- Reference William Gadea’s tool (ref) that uses Relative Error Cost (introduced below) to tell you exactly when to end your test.
- Showcase DMPG’s hyperspeed roadmapping tool, automating each of these statistical concepts into an easy to use set of dropdowns.
- Then we’ll wrap up with some common statistical blunders.
In Whitepaper Part 1: Perfect Statistical Significance
Statistical significance is the most important number within any experimentation program. Also known as Confidence, a % (usually 95%) is usually chosen as the minimum threshold for ending an experiment.
In part 1 of this whitepaper we introduce the concept of ‘optimal statistical significance threshold’ (OSS), before defining this in terms of the ratio of ‘error cost’ to ‘opportunity cost’ (ref) which we will refer to as Relative Error Cost (REC).
We will build on the work of a number of bodies in challenging the substantial & avoidable risk posed by universal adoption of very high Statistical Significance (SS) thresholds, particularly within extremely low REC environments (e.g. Digital Experimentation). We will then go on to suggest an alternative dynamic that takes us much closer to OSS.
note: some tools refer to ‘Confidence’ or ‘Probability of being a winner’ while others have options to toggle to a ‘Baysian’ method. Usually these are just different ways of stating Statistical Significance (SS) and as such here we will only refer to SS.
Context
We get this from medical research
Within medical research there is a longstanding convention of requiring 95% ‘Statistical Significance’.
It is commonly accepted that within this life-or-death context, being 94% sure that a new medicine will improve lives is simply not enough, and a 1 in 20 (95%) risk of false-positives is considered optimal.
Why?
The life-or-death context is actually irrelevant, statistically.
As the stakes increase, the risk of turning down an effective medication (false negatives) increases in proportion with the risk of launching an ineffective medication (false positive), with a net neutral effect on overall risk profile.
If this were only about high stakes, how could we justify turning down a medication that we are 94% confident will improve lives?
“how could we justify turning down a medication that we are 94% confident will improve lives?”
It’s actually about production & maintenance costs.
These costs are unique to the variant (the new medication) and as such raise “error cost” (cost of false negatives) far in excess of “opportunity cost” (cost of false positives).
- Opportunity cost (the cost of missing out on an improvement) is simply “customers not getting the best treatment”
- Error cost (the cost of launching an ineffective treatment) includes1:
- Customers not getting the best treatment
- $Ms investment in production
- The human cost of actively taking & paying for a new treatment which is actually ineffective
- The risk of this new treatment actually being worse than the control (damaging).
- Relative Error Cost (REC) (error cost relative to opportunity cost) is high, as high production & maintenance costs feature uniquely within Error Cost.
If (hypothetically) the new treatment could be teleported into people’s bodies at zero cost or effort then REC would be roughly 1:1 & the treatment would only need to be “more likely than not” to improve lives to be worth launching (i.e. SS would only need to be >50%).
It is the magnitude of production costs that cause error cost to far exceed opportunity cost (= high REC) and necessitate SS thresholds far in excess of 50%.
We can thus summarise as below:
“Optimal SS threshold starts at 51% and rises toward 100% in proportion to Relative Error Cost and not affected by the general magnitude of impact”
Application to Digital Experimentation
It’s all about production costs
Building on our conclusion above, we can confidently make the following statements:
- Ubiquitous adoption of 95% SS thresholds is far from optimal, and within low production cost environments can introduce extreme and unnecessary opportunity cost.
- The temptation to base SS threshold decisions on prominence must be resisted.
- An optimal testing program would set SS thresholds dynamically based on a case by case assessment of Relative Error Cost.
- Should such case-by-case assessment appear unworkable for any reason, as is typical, a binary & objective high vs low REC categorisation with accordingly high vs low SS thresholds should still substantially reduce overall risk as compared with ubiquitous use of 95% SS.
Examples
If pushing a test live and maintaining it incur zero net new cost or effort, then a 51% SS threshold would be statistically optimal.
In practice any new feature is likely to incur both production and maintenance costs.
Neutral REC scenarios are however surprisingly common and especially within mature digital personalisation programs, as we will find below:
Example 1: high error cost = high SS
We test a new search experience on our website, which upon winning must be hard-coded by the development team and will be more complex to maintain than the existing experience
- Production Cost = high
- Maintenance cost = high
- Therefore Error Cost far exceeds Opportunity cost (= high REC) & SS threshold should be high
If based on this dynamic we are to require 95% SS for this test, we are in effect saying:
“even if we are 94% confident that the new search page is better, it’s in our best interest to stick with the old page, because the high cost of implementing and maintaining this change places a greater burden of proof on the variant than the control”
Example 2: near-neutral REC = near-50% statistically optimal SS
A product recommendation module exists on the Basket Page that is served by Adobe Target, our testing & personalisation tool.
We test a new algorithm for this recommendation, which upon winning will remain served by Adobe Target and will incur a similar maintenance burden to the original module.
If the Variant wins then we duplicate the test and increase targeting to 100% of traffic; if the Control wins then we simply deactivate the test.
- Production Cost = near-zero
- Maintenance Cost = near-zero
If we are to require 95% SS for this test, we are in effect saying,
“even if we are 94% confident that the new algorithm is better, it’s in our best interest to stick with the existing algorithm, because [_______________]”
Filling in this blank is plainly challenging, and the 95% threshold cannot be justified.
A statistically optimal threshold here will be very close to 50%, after an appropriate minimum run-time.
Our Solution
An intelligent assessment of risk
DMPG’s solution is baked into our roadmapping practice, which we’ll cover later in this whitepaper series.
By default we make a binary assessment of Error Cost independent of Opportunity Cost, via a simple set of questions (see demo)
This tool asks 2 standard questions, and typically involves a few more checks specific to your business:
- Upon winning, this test will be launched with <1 human-day of development effort
- Upon winning, this test will incur a near-zero net maintenance burden
In the event that all of these tests are passed, the test is deemed “low REC” and SS threshold is set at 70%, after a predetermined max runtime. We then add granularity over time, building to a broader range of SS outputs.
The objectivity here is crucial, as is the simplicity, and it so far it’s worked well for us.
Next Up
How does all this sound to you?
With the maths out of the way, next we’ll hone in on personalisation. While aggregate testing can survive with very high SS thresholds, personalisation cannot and we’ll frame both problem and solution in part 2 of this whitepaper
____________________________________________________________________
Appendix
1 There is also typically a large body of pre-experiment data re-enforcing the effectiveness of the Control treatment. Because of this, while we may find that the Variant has a 90% chance of being the better treatment, within that 10% risk exists a danger of side effects or entirely null effect that far exceeds the equivalent risk within the control.
This factor is not expanded on within this whitepaper because within our core subject here (Digital UX Experimentation & Personalisation) Control treatments are not typically tested in any meaningful way pre-experiment and as such do not benefit from this phenomenon.