← Back to Blog
GuideMarch 9, 2026· 14 min read

App Store A/B testing for indie developers: what works and what doesn't

Both Apple and Google give you tools to test your store listing. Apple calls it Product Page Optimization (PPO). Google calls it Store Listing Experiments. In theory, you change your screenshots or icon, split traffic between variants, and pick the winner. In practice, if your app gets 200 impressions a week, you'll be waiting a long time for a statistically meaningful answer. Here's how to make these tools actually useful when you don't have enterprise-level traffic.

What you can actually test

The two platforms give you different testing surfaces. Apple's PPO lets you test up to three treatments against your original for app icon, screenshots, and app previews (videos). You can run one test at a time, and each treatment can change one or more of those elements. Google's Store Listing Experiments is more flexible: you can test the icon, feature graphic, screenshots, short description, and long description. Google also lets you run multiple experiments simultaneously.

One thing to note: Apple's icon test requires you to submit the variant icons with a new app binary. So if you want to test a new icon on iOS, you need to plan it as part of an app update. Screenshots and previews don't require a new build.

On Google Play, everything is decoupled from your APK/AAB. You can change and test any listing element whenever you want. This makes iteration much faster on Android.

The traffic problem nobody talks about

Here's the math that most A/B testing guides skip. To detect a 5% improvement in conversion rate with 95% confidence and 80% power, you need roughly 15,000 impressions per variant. For a 10% improvement, you need about 4,000. For a 20% improvement, around 1,000.

Most indie apps get somewhere between 500 and 5,000 impressions per week. If you're testing two variants and need 4,000 impressions each, that's 8,000 total. At 1,000 impressions a week, you're looking at eight weeks for a single test. And that only catches a 10% lift. If your change produces a smaller improvement, you'll either miss it or get a false result.

This doesn't mean you shouldn't test. It means you should test big changes, not subtle ones. Swapping your entire screenshot sequence for a different approach has a chance of producing a 20%+ difference. Changing the font size on your screenshot captions probably won't move the needle enough to measure.

What to test first

If you're going to run one test, make it your first screenshot. On the App Store search results page, most people see only the first two or three screenshots before deciding whether to tap. The first screenshot is your headline. It determines whether anyone looks at anything else.

The second-highest-impact thing to test is your icon. It appears everywhere: search results, charts, recommendations, the home screen. A better icon improves tap-through rates across every surface. But remember on Apple, you need a new binary to test icons, so it's more effort.

Here's an ordering that makes sense for most indie apps:

  1. First screenshot (biggest impact on browse-to-detail conversion)
  2. Screenshot sequence and ordering (the full story you tell)
  3. App icon (affects every impression, but harder to test on iOS)
  4. App preview video (if you have one; many people scroll past)
  5. Descriptions (rarely read on iOS; somewhat more on Google Play)

For descriptions specifically: on iOS, almost nobody taps "more" to expand the description. Your conversion rate is determined by what's visible without tapping. On Google Play, the short description gets more attention, so testing that is reasonable.

How to design a test that actually tells you something

The biggest mistake people make is testing too many things at once. If your treatment changes the screenshot layout, the caption text, the color scheme, and the order, you won't know which change mattered. Keep it to one variable per test when possible.

That said, with low traffic, you sometimes need to make a bolder move to get a measurable signal. In that case, test two genuinely different approaches rather than small variations. For example: test a screenshot set that leads with a feature close-up against one that leads with a lifestyle shot. That's a bigger difference than "blue background vs. green background."

Before you start, write down what you expect to happen and why. "I think the feature-first screenshots will convert better because our app is utilitarian and people search for specific functionality." This sounds like extra work, but it prevents you from rationalizing whatever result you get after the fact.

Set a minimum duration before you look at results. Two weeks is reasonable for most indie apps. Apple recommends running PPO tests for at least seven days to account for weekly patterns, but the statistical significance indicator in App Store Connect is what you should actually watch. If it says you need more data, you need more data, even if it's been three weeks.

Reading the results without fooling yourself

Apple's PPO shows you the improvement range with a confidence level. If the range spans zero (like -3% to +8%), the test is inconclusive. It doesn't mean there's no difference; it means you don't have enough data to tell. Running the test longer might help, or the real difference might be too small to measure at your traffic level.

Google shows you a similar confidence indicator. They'll tell you when a result has reached 90% confidence. For indie apps, I'd treat anything below 90% as "not enough data yet" rather than a definitive answer.

Watch out for these traps:

  • Stopping early because you like the direction. Wait for the statistical significance threshold. Early trends reverse more often than you'd think.
  • Ignoring seasonality. If you start a test right before a holiday and end it after, external factors may explain your results more than the change did.
  • Fixating on conversion rate without checking volume. A treatment might convert better but attract fewer impressions (possible if Apple's algorithm factors in engagement). Look at both rate and absolute numbers.
  • Running the same test repeatedly hoping for a different outcome. If a test was inconclusive, the change probably doesn't matter much. Move on to testing something bigger.

Apple PPO specifics

Product Page Optimization launched with iOS 15 and has improved since then. Here's what you need to know:

You get up to three treatments plus the original (control). Each treatment can have a different icon, different screenshots, different preview video, or any combination. Traffic is split evenly by default, but you can adjust the allocation.

For indie devs, I recommend two variants at most (one treatment plus the original). With three or four variants, you're splitting already limited traffic further, and it takes much longer to reach significance. Apple splits traffic evenly by default, and with 500 impressions per week split three ways, you're looking at months.

One useful thing about PPO: it only counts organic App Store traffic. If you're running marketing campaigns that send traffic directly to your listing, that traffic doesn't pollute the test. Clean data is one area where Apple's more controlled approach actually helps.

You can also use Custom Product Pages (CPPs) separately from PPO. CPPs are landing pages for specific audiences (one for a fitness campaign, another for a health campaign), each with unique screenshots and text. They're not A/B tests; they're targeted pages you link to from ads or marketing channels. But you can use insights from PPO to decide what goes on each CPP.

Google Play Store Listing Experiments

Google's system is more flexible. You can run experiments on the default listing or on custom store listings (their equivalent of CPPs). You can test graphics, text, or both. And you can run multiple experiments at once as long as they test different elements.

Google also gives you localized experiments. If your app is available in multiple countries, you can run a screenshot test for just your Japanese listing without affecting your US listing. This is useful if you've localized for specific markets and want to optimize each separately.

One thing Google has that Apple doesn't: you can test your short description and long description. On Google Play, more people actually read the description than on Apple's App Store, so this test can produce meaningful results. Test your first line (the part visible before "Read more") rather than burying changes in paragraph four.

Testing when you don't have enough traffic

If your app gets fewer than 500 impressions per week, formal A/B testing isn't going to give you reliable results in any reasonable time frame. That doesn't mean you can't improve your listing. It means you need different methods.

One approach: look at what's working for competitors. The competitor analysis method I wrote about earlier works here. If the top five apps in your category all lead with a similar style of screenshot, that's a data point. It's not an A/B test, but it's informed by thousands of data points (their downloads and conversion rates).

Another approach: qualitative testing. Show your two screenshot options to 10 people in your target audience and ask them what each app does and whether they'd download it. This takes an afternoon and gives you signal that months of low-traffic A/B testing wouldn't.

You can also use the "just ship it" method. Make the change, track your conversion rate in App Store Connect analytics for two weeks before and two weeks after, and compare. This isn't scientifically rigorous because other variables change (seasonality, chart position, featuring), but if your conversion rate jumps 30%, you can be reasonably confident the change helped.

The worst option is doing nothing because you don't have enough traffic for a proper test. Make your best guess, ship it, and iterate. An imperfect improvement process beats a perfect one that never starts.

What good results actually look like

In my experience looking at indie app data, here are rough benchmarks for what's achievable:

  • Changing your first screenshot from a feature list to a value-oriented message: 10-30% conversion lift
  • Redesigning all screenshots with consistent branding: 15-25% lift
  • Changing your icon: anywhere from -10% to +20% (icons are unpredictable; sometimes your gut feeling is wrong)
  • Adding a preview video: usually 5-15% lift, but varies a lot by category
  • Changing description text on iOS: typically unmeasurable (because nobody reads it)
  • Changing short description on Google Play: 3-8% lift if you nail the first line

The biggest wins almost always come from screenshots. If you've never optimized your screenshots, start there before touching anything else.

A testing calendar for one year

If you have moderate traffic (1,000-5,000 weekly impressions), here's a reasonable testing plan:

Months 1-2: Test your first screenshot. Two variants, big difference in approach (feature-first vs. benefit-first, or dark background vs. light). Apply the winner.

Months 3-4: Test your screenshot sequence. Keep the winning first screenshot and test different orderings for screenshots 2-5. Or test a completely different story structure.

Month 5-6: Test your icon. Prepare two or three icon variants with an app update. Make them genuinely different (shape, color, or concept), not minor tweaks.

Month 7-8: If you have an app preview video, test it against no video, or test two different videos. If you don't have one, consider whether your category is one where videos help (games: yes; productivity tools: sometimes; utilities: rarely).

Months 9-12: Re-test screenshots with new learnings. Your understanding of what resonates with users has improved. Apply everything you've learned about ASO and try again.

Between tests, don't let the listing sit unchanged. Use the downtime to study competitor listings and prepare your next hypothesis.

Mistakes I keep seeing

Testing color variations. Unless your current color actively repels people, color differences produce tiny effects that require massive traffic to measure. A solo developer testing blue vs. green backgrounds is burning months for nothing.

Not having a hypothesis. "I want to see which is better" is not a hypothesis. "I think showing the app in dark mode as the first screenshot will convert better because 70% of our users have dark mode enabled" is one. The hypothesis tells you what to do if the test is inconclusive.

Testing when you should be fixing. If your analytics show a 1% conversion rate on your store listing, you don't need a test. You need new screenshots. An A/B test is for choosing between two reasonable options, not for confirming that bad creative is bad.

Forgetting about localization. If 40% of your traffic comes from Japan, your test results are blended across all markets. A screenshot that resonates in the US might hurt conversions in Japan. Consider running localized experiments on Google Play or creating localized screenshots for each market.

Over-optimizing the listing while ignoring the product. If users download but immediately uninstall, your retention problem is more important than your conversion rate. A beautiful listing that attracts the wrong users is worse than an average listing that sets honest expectations.

When A/B testing isn't worth your time

I want to be honest about this: if your app is brand new with under 200 weekly impressions, your time is better spent on marketing and distribution than on optimizing conversion. Getting from 200 to 2,000 impressions through better ASO and keyword targeting will produce more downloads than a 15% conversion improvement on 200 impressions.

At 200 impressions with a 3% conversion rate, that's 6 downloads per week. A 15% improvement gets you to 6.9. Getting to 2,000 impressions at the same 3% rate gives you 60 downloads. The math is obvious, but I keep seeing indie devs spend weeks tweaking screenshots for an audience that barely exists yet.

The same logic applies to very mature apps with already-high conversion rates. If you're converting at 35%, there's less room to improve. Your effort is better spent on reducing churn or increasing revenue per user.

Where A/B testing fits in the bigger picture

A/B testing your store listing is one piece of the conversion puzzle. The other pieces are your keyword strategy (which determines who sees your listing), your reviews and ratings (which influence the decision after someone reads your listing), and your paywall (which determines whether a download becomes revenue).

Think of it as a funnel: impressions (ASO keywords) → listing views (icon appeal) → detail page views (first screenshot) → downloads (full listing) → engagement (onboarding) → revenue (paywall). Testing your store listing optimizes the middle of this funnel. Don't neglect the top (getting seen) or the bottom (keeping people).

If you haven't set up basic analytics yet, do that first. You need to know your current conversion rate, where your traffic comes from, and what your retention looks like before testing makes any sense. Testing without baseline data is just guessing with extra steps.

Set up your first test in 30 minutes

On Apple:

  1. Open App Store Connect → your app → Product Page Optimization
  2. Create a new test. Name it something descriptive ("Benefit-first screenshots Mar 2026")
  3. Add one treatment. Upload your variant screenshots
  4. Set traffic allocation to 50/50
  5. Start the test and don't check it for at least seven days

On Google Play:

  1. Open Google Play Console → your app → Store Listing Experiments
  2. Choose "Default graphics" or "Default description"
  3. Add your variant assets
  4. Set the audience percentage (50% is fine for indie apps)
  5. Launch and let it run until Google indicates confidence

The preparation is where the time goes: designing alternative screenshots, picking the right hypothesis, deciding what change is bold enough to measure. The actual setup takes minutes.

Start with something you've been uncertain about. If you've wondered whether your screenshot captions are too long, or whether leading with a different feature would work better, test it. The worst outcome is learning something. And if you're using our scanner data to find opportunities, a well-optimized listing makes every opportunity worth more.