Introduction
A mature personalisation program will involve a large number of experiences, with impact measured at both the Experience & Aggregate levels
To demonstrate ROI, the program must approximate aggregate commercial (usually Revenue) impact, and we have two options for this:
- Global holdout group: a small Control that sees no personalisation at all
- Sum of Experience-level impacts: a simple sum formula
Historically I’ve usually recommended #1, due to its relative statistical simplicity and accounting for changing customer sentiment over time.
More recently however, we’ve seen this leading on some occasions to debilitating levels of confusion, as month-by-month the Experience & Aggregate-level results can diverge enormously.
Example: 20 of your 25 Experiences may generate revenue in January, with a summed impact of £5M. With method #1 & assuming normal volatility, the Global Impact number will often be negative. Statistically this is not a problem, but when stakeholders see two numbers contradicting each other like this, a hard-to-budge sense of mistrust can easily permeate which then inhibits the spread of Personalisation.
In this document we will focus on method #2, seeking to clarify the impact of relevant statistical concepts on aggregation.
Relevant Statistical Concepts
There are three key statistical concepts that can lead to distorted results when running & summing large numbers of overlapping tests
1) The Multiple Comparisons Problem – inflates
- Problem: this does inflate summed Impact numbers
- Why? if we sum results from large numbers of tests, false positive rate increases and summed uplift is inflated (see below)
2) Overlapping Samples – exaggerate
- Problem: increases volatility of results, but on average should have a net neutral effect on uplift – not an inflationary force, at scale.
- Why? if the same users are part of multiple tests and their actions (like a £100 order) are counted separately in each test, it will artificially inflate the total summed impact number.
- This is not an inflationary force: within the 5% controls there will (at scale) be roughly 1/19th as many instances of these overlapping orders – making for a net neutral effect on summed impact.
- But it can lead to moments of inflation: if, by random chance, these audience overlaps are not spread evenly between Control and Variant, then sometimes certain people will have a big inflationary effect on summed impact, and sometimes a big deflationary effect.
3) Interaction Effects – distort
- Problem: each test will skew each other test in an unknown way. This is not an inflationary force, just a distorter.
- Why? Changes in one experience could affect user behaviour in another. If tests are run simultaneously without accounting for these interactions, the overall impact could be misrepresented.
Inflation is usually the most problematic of these, so let’s dive into the MCP…
The Multiple Comparisons Problem
The “multiple comparisons problem” (MCP) (ref) – dictates that stricter significance thresholds are required when summing multiple experiments (due to increased chance of false positives).
The MCP becomes relevant only when impact from multiple tests are summed.
- Experience-level is OK: When we roll up markets to get to experience-level results: this is not a problem at the aggregate level. We just need to bear in mind that some markets would win and loose even in the absence of any change in experience
- Program-level is not: Rolling up experiences to get to program-level results however does suffer from the MCP.
What experience-level significance threshold would we need, to maintain a 5% overall false positive rate?
Using the bonferroni method
- If summing 35 tests:
- 5% aggregate false positive rate requires 99.86% threshold
- 20% aggregate false positive rate requires 99.4% threshold
- 25 tests:
- 5% false positive rate requires 99.80% threshold
- 20% false positive rate requires 99.2% threshold
- 10 tests:
- 5% false positive rate requires 99.50% threshold
- 0% false positive rate requires 98% threshold
- 5 tests:
- 5% false positive rate requires 99.0% threshold
- 20% false positive rate requires 90% threshold
Conclusion Summary
While the Overlapping Samples & Interaction Effects problems can lead to excessive volatility, only the Multiple Comparisons Problem can lead to consistently inflated win rates and conversion/ revenue uplifts.
The MCP becomes relevant only when impact from multiple overlapping tests are summed.
- If for example we roll up multiple markets to get to global experience-level results, then (assuming markets are distinct websites) this is not a problem at the aggregate level. Statistically, the aggregate measurement here can unproblematically be considered a single global test, so the MCP is not relevant.
- (we just need to bear in mind that some markets would win and lose even in the absence of any change in experience).
- Rolling up experiences to get to program-level results however does suffer from the MCP. See solution options below.
Solution Options
Solution Option 1: structural reduction in # of tests
We could reduce # of “tests” by combining those that are similar, when summing
For example:
- All 25 experiences are evaluated independently for optimisation purposes
- For Global impact measurement however, we treat each page as a “test”, making for 10 tests in total
Demo: looking at a hypothetical Zero Results example
- Based on Your Last Order = £500k incremental
- Because You Purchased X = £500k incremental
- Popular in Favourite Category = £500k incremental loss
- Contribution to Global Impact = £500k
This would work under current measurement practice
Under proposed measurement framework evolution however, approx 50% of tests would be likely to fail, so these sums would sit close to zero
Solution Option 2: build dynamic bonferroni correction into Aggregate Impact measurement[more detail needed]
What do you think?
Have you come up against these statistical gremlins within your personalisation program? Do these solution options seem workable, or have you discovered alternative strategies?
Let me know!