Think about the largest and smallest you see today, in [dollars]. Write those min and max amounts below.
_____________
Title of Card
What is the average [online purchase value] you see today, in [dollars]? Drag the dot there.
Now, think about the middle 70% of the [online purchase values] you see today, in [dollars]–the ones that aren't extreme in any direction and Drag the barbells so they shade over that 70%.
Finally, think about how you are hoping to shift those [online purchase values] with this experiment. What is the smallest shift you'd need to see in the experiment, to feel confident rolling out the idea your experimenting with more broadly? Drag the shaded area –the average and the 70% --to represent that minimum shift.
Because an experiment is sampling from a broader population, there is always a chance (even with the best random sampling methods) that we accidentally draw from some extreme set of people and see results that are differentthan the truth of that population. We could see a false positive – an effect in the sample that really isn't there in the population – or a false negative – no effect in the sample, even though there really is one in the population.

We won't know their falsity, of course. Which means with a false positive, we may move forward with an idea that really doesn't work. Or with a false negative, we may withhold an idea that really does work.

Think about this in the context of your hypothesis: [hypothesis]. Which would be worsefor you, financially, PR-wise, politically, operationally...? A false positive, i.e., moving forward with an idea that really doesn't work? Or a false negative, i.e., withholding one that really does work? Drag the ball to indicate their relative risk.
You run the experiment, and you see an effect! You go tell your boss. But you need to remind your boss there is some % chance that the effect is just a false positive... What is the largest % chance of a false positive that you –and your boss –are comfortable with? Drag the bar to indicate it.
What is the absolute maximum amount of [units] you could use for your sample in this experiment? Consider how more [units] can often mean greater financial cost, greater time to implement, and / or greater risk.

Wizard: Power calculation for comparing proportionsaverages

Is the number of (s) you can experiment with extremely large or unlimited? Or not, due to budget, access, or other constraints?
Sensitivity power calculation for comparing proportions
Categorical Sensitivity Analysis Current Scenario
alpha (risk of false positives or type I error rate): The likelihood of measuring at least the observed result, when in fact there is no effect at all
help_outline
Assigning an alpha level asks you to live in a world where you've ran the experiment and found a seemingly statistically significant effect size. If you select a p-value of .1, you've decided to live in a world where 1 out of every 10 experiments of this design would produce an effect size as least as large as the one you've measured even though there is truly no effect at all.
Power (1-risk of false negatives or 1-type II error rate): The likelihood that, when a treatment has an effect, you will be able to distinguish the effect from zero
help_outline
Determining power asks asks you to live in a world where you've ran the experiment and found a seemingly significant null result. If you select power of .95, you've decided to live in a world where 1 out of every 20 experiments of this design would produce an effect size of zero even though there is truly an effect
Total # of (s) available for experiment
Expected control group proportion that NEW QUESTION TODO (% success in control group)
help_outline
If we collected 100 samples of the DV from your control group, how many would meet the criteria of success as you define it? (e.g., if your DV is click through rate on an email, how many emails out of 100 do you expect to receive a click in your control group?)
Expected treatment group proportion that NEW QUESTION TODO (% success in treatment group)
help_outline
This value establishes the minimum proportion that your treatment group must achieve in order to produce a statistically significant result. E.g., treatment group expected proportion of 13 or 15 means your intervention must produce an average of more than 15%, or less than 13%, to produce a statistically significant result
-
Minimum Detectable Effect (MDE)
help_outline
This value is establishes the minimum effect size between your control and treatment comparison group(s) that your test must measure in order to achieve statistical significance. If you're not very confident in your rationale for why it makes sense your treatment would produce an incremental improvement on the control of this size, that is a sign you may want to tighten your expected treatment group estimate
-
Sensitivity power calculation for comparing averages
Continuous Sensitivity Analysis Current Scenario
alpha (risk of false positives or type I error rate): The likelihood of measuring at least the observed result, when in fact there is no effect at all
help_outline
Assigning an alpha level asks you to live in a world where you've ran the experiment and found a seemingly statistically significant effect size. If you select a p-value of .1, you've decided to live in a world where 1 out of every 10 experiments of this design would produce an effect size as least as large as the one you've measured even though there is truly no effect at all.
Power (1-risk of false negatives or 1-type II error rate): The likelihood that, when a treatment has an effect, you will be able to distinguish the effect from zero
help_outline
Determining power asks asks you to live in a world where you've ran the experiment and found a seemingly significant null result. If you select power of .95, you've decided to live in a world where 1 out of every 20 experiments of this design would produce an effect size of zero even though there is truly an effect
Total # of (s) available for experiment
Pooled standard deviation of DV: How variable or 'spread' is the outcome (dependent) variable in both test & control group(s)?
help_outline
If have no sample data to establish SD, you can estimate standard deviation with the following thought experiment: Excluding outliers, subtract the minimum DV value you'd expect from the maximum DV value and divide by 6
Expected control group NEW QUESTION TODO average
help_outline
If we collected 100 samples of the DV from your control group and averaged it, what number do you estimate?
Expected treatment group NEW QUESTION TODO average
help_outline
This value establishes the minimum average that your treatment group must achieve in order to produce a statistically significant result. E.g., treatment group expected average of 27 or 33 means your intervention must produce an average of more than 33, or less than 27, to produce a statistically significant result.
-
Minimum Detectable Effect (MDE)
help_outline
This value is establishes the minimum effect size between your control and treatment comparison group(s) that your test must measure in order to achieve statistical significance. If you're not very confident in your rationale for why it makes sense your treatment would produce an incremental improvement on the control of this size, that is a sign you may want to tighten your expected treatment group estimate
-

Please note:

Test group expected proportionaverage & MDE: Our power calculators assume you will run a two-sided test, meaning you're willing to consider the chance that the treatment may increase OR decrease the value of your primary outcome. The two values indicate a statistically significant average decrease and increase in primary outcome relative to the expected control outcome.

A priori power calculation for comparing proportions
Categorical A-Priori Analysis Current Scenario
alpha (risk of false positives or type I error rate): The likelihood of measuring at least the observed result, when in fact there is no effect at all
help_outline
Assigning an alpha level asks you to live in a world where you've ran the experiment and found a seemingly statistically significant effect size. If you select a p-value of .1, you've decided to live in a world where 1 out of every 10 experiments of this design would produce an effect size as least as large as the one you've measured even though there is truly no effect at all.
Power (1-risk of false negatives or 1-type II error rate): The likelihood that, when a treatment has an effect, you will be able to distinguish the effect from zero
help_outline
Determining power asks asks you to live in a world where you've ran the experiment and found a seemingly significant null result. If you select power of .95, you've decided to live in a world where 1 out of every 20 experiments of this design would produce an effect size of zero even though there is truly an effect
Expected control group proportion that NEW QUESTION TODO (% success in control group)
help_outline
If we collected 100 samples of the DV from your control group, how many would meet the criteria of success as you define it? (e.g., if your DV is click through rate on an email, how many emails out of 100 do you expect to receive a click in your control group?)
Expected treatment group proportion that NEW QUESTION TODO (% success in treatment group)
help_outline
This value establishes the minimum proportion that your treatment group must achieve in order to produce a statistically significant result. E.g., treatment group expected proportion of 13 or 15 means your intervention must produce an average of more than 15%, or less than 13%, to produce a statistically significant result
Minimum Detectable Effect (MDE)
help_outline
This value is establishes the minimum effect size between your control and treatment comparison group(s) that your test must measure in order to achieve statistical significance. If you're not very confident in your rationale for why it makes sense your treatment would produce an incremental improvement on the control of this size, that is a sign you may want to tighten your expected treatment group estimate
-
Sample size per comparison group -
A priori power calculation for comparing averages
Continuous A-Priori Analysis Current Scenario
alpha (risk of false positives or type I error rate): The likelihood of measuring at least the observed result, when in fact there is no effect at all
help_outline
Assigning an alpha level asks you to live in a world where you've ran the experiment and found a seemingly statistically significant effect size. If you select a p-value of .1, you've decided to live in a world where 1 out of every 10 experiments of this design would produce an effect size as least as large as the one you've measured even though there is truly no effect at all.
Power (1-risk of false negatives or 1-type II error rate): The likelihood that, when a treatment has an effect, you will be able to distinguish the effect from zero
help_outline
Determining power asks asks you to live in a world where you've ran the experiment and found a seemingly significant null result. If you select power of .95, you've decided to live in a world where 1 out of every 20 experiments of this design would produce an effect size of zero even though there is truly an effect
Pooled standard deviation of DV: How variable or 'spread' is the outcome (dependent) variable in both test & control group(s)?
help_outline
If have no sample data to establish SD, you can estimate standard deviation with the following thought experiment: Excluding outliers, subtract the minimum DV value you'd expect from the maximum DV value and divide by 6
Expected control group NEW QUESTION TODO average
help_outline
If we collected 100 samples of the DV from your control group and averaged it, what number do you estimate?
Expected treatment group NEW QUESTION TODO average
help_outline
This value establishes the minimum average that your treatment group must achieve in order to produce a statistically significant result. E.g., treatment group expected average of 27 or 33 means your intervention must produce an average of more than 33, or less than 27, to produce a statistically significant result.
Minimum Detectable Effect (MDE)
help_outline
This value is establishes the minimum effect size between your control and treatment comparison group(s) that your test must measure in order to achieve statistical significance. If you're not very confident in your rationale for why it makes sense your treatment would produce an incremental improvement on the control of this size, that is a sign you may want to tighten your expected treatment group estimate
-
Sample size per comparison group -
Business Experiment Launchpad

Running Experiments is Vital to Making Informed Business Decisions.

Launchpad helps you build, manage, and de-risk your experiments.

Jot Down Your Early Ideas

Lean on our battle-tested template to think through your experiment goals, risks, and benefits.

Don’t worry, it’s okay to leave answers blank on your first iteration.

Iterate on Your Answers

Share your draft with your teammates and iterate until you are happy with all your responses.

You can also print out a hard-copy for those train rides home.

Launch Your Experiment

You’ve answered the tough questions and built a design to maximize the ROI to your business.

Time to launch your bullet-proof experiment!

Who is Launchpad for?

We built this tool for busy professionals who want to improve their decisions with disciplined testing, but don't have training in statistics or data science.

If you own a brand, product, or process, you should be in the driver's seat when designing an experiment. With Launchpad, you're just a couple clicks away from science-driven business decisions.

Use Launchpad to play around with new ideas, collaborate with teammates, and record your thoughts over time.

Origin Story

Linnea first developed the Business Experiment Launchpad for her MBA students as they learned how to run randomized controlled experiments at the University of Chicago Booth School of Business.

Alongside Nobel Prize winner Richard Thaler, she developed and ran a unique year-long business lab class where MBA students partnered with outside companies to design, implement, and test behavioral 'nudges'. Some students were nervous about the statistics involved in experiments, others struggled to keep track of all the moving parts across design and deployment, and still others failed to anticipate risks and pitfalls along the way.

First on paper and then in software, Linnea built the Business Experiment Launchpad to support her students and eventually her own business clients as they ran tests. And now, she and her team are sharing it with you!

Got suggestions for ways we can improve Launchpad? Drop us a note at hello@behavioralsight.com!

Experiment #5 Last Edited 4 days ago

Goals: Why are you conducting research?
help_outline
What is the strategic importance and/or impact to the organization of rigorously studying this topic?
help_outline
What do you want to learn in this particular study?
Variables: What are you measuring?
help_outline
What change(s) are you going to make, that you think will in turn change the outcome in the world that you care about? E.g., the words in an email subject line, the color of a website button, the greeting a salesperson says when a customer walks in the door, etc.
help_outline
What outcome in the world do you care about, can measure, and want to try to influence with your idea? E.g., customers open our emails more frequently, customers purchase higher value products, employees stay with the company longer, etc.

Categorical: a frequency, probability, or percentage describing an action that is or isn't taken, a threshold that is or isn't met, or a label that is or isn't given, e.g., the % of customer open a savings account in 2020, the % of customers save over 5k a year, the % of customers gave a satisfaction rating of 5 out of 5

Continuous: numerical data with a wide, continuous range of values like money, weight, or count, e.g., the average amount saved by customers, the average number of days an account stays open

Fill in the blank: an action or attribute of individuals that you want to compare, for instance "reported the highest satisfaction rating", or "clicked on the promotion." Fill in the blank: an average numerical amount that you want to compare, for instance “customer satisfaction rating", or “number of clicks on our webpage.
help_outline
What other outcomes in the world do you care about, can measure, and want to try to influence with your idea? These will be used in exploratory analyses, while the "primary outcome" will be the focus of this experiment blueprint
help_outline
Through what medium are you running the experiment? E.g., email, call center, website, posters in a physical space, etc.
help_outline
Whose behavior are you trying to change? E.g., all customers, customers in a certain region, customers of a certain product, all employees, employees on a certain team, etc.
Hypothesis: What do you predict and why?
help_outline
Your hypothesis should be a statement with an "if-then" logic (e.g., "if I change X then Y will happen"); a counterfactual or control (e.g., "versus if I do NOT change X"); specific (describe, X, Y, and the direction of the changes so that a colleague as well as a stranger on the street could understand); testable (you can make the changes to X that you desire, and you can measure the changes to Y that you expect); and falsifiable (there is at least some possibility that you could see different results than you predict)

Suggested Hypothesis from your prior responses:
"If we change , then we will see a difference in the average "

Suggested Hypothesis from your prior responses:
"If we change , then we will see a difference in the proportion that "


help_outline
What is the supporting rationale for your prediction? Your rationale could come from prior experiments (e.g., psychology studies in academia, other experiments you have run), prior non-experimental research (e.g., focus groups, interviews, surveys), user feedback (e.g., comments, complaints, anecdotes), external examples (e.g., competitor initiatives, ex-industry initiatives), or other sources. In any case, the source of your hypothesis should be clearly stated and go beyond mere gut feeling.
Design: What is your experiment setup?
help_outline
How many versions of the independent variable will you have? Will there be a "control" group that doesn't get any version, or gets a "plain" version? (e.g., 1 control + 2 ' new' versions to test = 3 groups)
help_outline
What version of the independent variable will the control group experience? Nothing? Something 'vanilla'? The status quo?
help_outline
What version of the independent variable will test group 1 experience?
help_outline
For each additional test group, what version of the independent variable will they experience?
help_outline
Fill in the blank: if you are running an experiment with employees, there may be a risk that employees who work closely together may discover they received different experiences (e.g., one had the control and one had the treatment). To avoid this risk of “contamination”, you may want to randomize by work spaces, office floors, or office buildings.
Sampling Plan & Treatment Assignation: Who is participating in the experiment?
help_outline
Often we cannot run an experiment on the entire population we care about; we must pick a sub-set. Sometimes we cannot even run an experiment on anyone in the population we care about; we must pick a proxy. From where will you source your experiment participants (e.g., a sample of clients attending our annual conference, Amazon Mechanical Turk workers, customer emails)?
It may be important to filter out, or select for, different qualities in order to ensure you zero-in on the 'right' (representative) population.
help_outline
It may be important to filter out, or select for, different qualities in order to ensure you zero-in on the 'right' (representative) population.
help_outline
Participants can be assigned to groups based on a variety of methods, but the assignment must be random.
help_outline
When determining your sample size, it's important to consider the tradeoffs of smaller vs. larger samples. Generally speaking, a larger sample means it's more likely that the results you're seeing in the experiment more closely match the results you'd see if the experiment was run with the entire population. A larger sample also means it is more likely that if there is just a small effect, you will still be able to detect it with the experiment. However, a larger sample also often requires more time and money.
help_outline
Ideally participants will not know they are part of an experiment to avoid confounding the interpretation of results. If there is a risk they will, plan ahead to avoid or mitigate.
help_outline
When will you begin and end your experiment? Sometimes this timing is based on calendars; other times this timing is based on the expected length of time to sample enough participants. ("Enough" will be determined in the next part of the blueprint.)
Final Step

That's it! Save your work, and keep iterating as needed. Use this page as a blueprint to build your experiment.

Use the menu to invite teammates or print this experiment or save it as a PDF.

If you need help along the way, our team would be happy to assist you - just reach out to hello@behavioralsight.com