Testing Website Changes: P-values

Related Concepts

P-value Hypothesis Testing Null Hypothesis (H₀) Alternative Hypothesis (H₁) Statistical Significance Significance Level (Alpha, α) A/B Testing Type I Error (False Positive) Type II Error (False Negative) Business Decision Making Communicating Technical Results Confidence in Results

Hint

Think about how to explain this in terms of "strength of evidence" or "how surprising the results are."

Low p-value (0.03):
- This means if there was NO real difference between Version A and B, we'd only expect to see results this extreme (or more so) about 3% of the time just by chance.
- Is 3% a low chance or a high chance? How does this affect our confidence that the difference is real?
- What decision does this typically lead to?
High p-value (0.3):
- This means if there was NO real difference, we'd expect to see results this extreme (or more so) about 30% of the time just by chance.
- Is 30% a low chance or a high chance? How does this affect our confidence?
- What decision does this typically lead to?
Consider using an analogy, like a courtroom: Is there enough evidence to convict (i.e., say there's a real difference)?
Relate it to a common decision threshold (like 5% or p=0.05).

Explanation: P-values for Stakeholders in A/B Testing

"Hi [Stakeholder Name], we've just got the results from our A/B test comparing the new webpage (Version B) against our current one (Version A). To help us decide if Version B is truly better, we look at a number called a 'p-value'. Let me explain what that means for us."

"Think of it this way: we start by assuming that there's actually no real difference in performance between Version A and Version B. The p-value then tells us: if that assumption of 'no difference' is true, what's the chance that we'd see the results we got (or something even more different) just by random luck?"

If we got a P-value of 0.03 (or 3%):

"A p-value of 0.03 is like saying: 'If Version A and Version B were actually performing exactly the same, there would only be a 3% chance that we'd see a difference as big as the one we observed in our test, just due to random variations in who saw which page.'"

What this means for our decision: A 3% chance is quite small. It suggests that it's unlikely our results are just a fluke if the pages were truly equal. This gives us strong evidence to believe that the difference we saw is real and that one version is genuinely performing differently from the other (hopefully, Version B is better if that's what the data showed!).
Our confidence level: We can be pretty confident that the "winning" version in our test is actually better and not just a result of luck.
Typical action: This is a green light. If Version B showed better results (e.g., more sign-ups) and we have this low p-value, we'd generally recommend moving forward with Version B because the data supports that it's a real improvement. We usually look for p-values less than 0.05 (or 5%) to feel this confident.

If we got a P-value of 0.3 (or 30%):

"A p-value of 0.3 is like saying: 'If Version A and Version B were actually performing exactly the same, there would be a 30% chance that we'd see a difference as big as the one we observed in our test, just due to random variations.'"

What this means for our decision: A 30% chance is relatively high. It means that even if there's no real underlying difference between the pages, we could easily get results like these just by chance. This doesn't give us strong evidence to say that one version is truly better than the other. The difference we saw could very well be a fluke.
Our confidence level: We can't be very confident that the "winning" version in our test is genuinely superior. The results are inconclusive.
Typical action: This is more like a yellow or red light. We wouldn't confidently launch the "winning" version based on this. Our options would be:
- Stick with Version A (if it's our current, stable version).
- Consider if the test needs to run longer to gather more data, especially if the observed difference, while not statistically strong, looked promising or was large in practical terms.
- Conclude that, based on this test, we don't have enough evidence to say Version B is better, and we might need to explore other ideas for improvement.

The Key Takeaway for Decision Making:

"We generally use a threshold, often 0.05 (or 5%). If the p-value is below this threshold (like 0.03), we say the result is 'statistically significant,' meaning we have good evidence that the difference is real. If it's above this threshold (like 0.3), the result is 'not statistically significant,' and we can't confidently say the difference isn't just due to chance."

In a nutshell:

P-value = 0.03: "The results look strong! Version B is likely genuinely better. Let's consider rolling it out."
P-value = 0.30: "The results are not clear enough. The difference we saw could easily be random chance. We shouldn't make a change based on this alone."

A Quick Note: The p-value doesn't tell us how big the difference is, only how likely we are to see a difference of that size (or larger) if there was no real underlying difference. So, even with a low p-value, we also need to look at the actual size of the improvement (e.g., did Version B increase conversions by 20% or just 0.2%?) to decide if it's a practically significant change worth implementing for the business.

Interpreting A/B Test Results: The P-value

What is A/B Testing? Choosing the Better Option

What a P-value Tells Us (and what it DOESN'T)

P-value Explanation for Stakeholders