OpenAI on Mobile Development, Previews, & Snapshot Testing

December 10, 2024 by

Eric Horacek, Anton Tananaev, & Josh Cohenzadeh

FeaturedSnapshot Testing

OpenAI on Mobile Development, Previews, & Snapshot Testing

Background

OpenAI's ChatGPT is one of the most popular and fastest-growing apps in the world. And from the start, OpenAI has utilized Emerge Tools Snapshots to help ensure they ship a top-tier, bug-free UI on iOS, Android and macOS.

OpenAI Snapshot images generated by Emerge (monthly)

Using Emerge's Snapshots means OpenAI hasn't written any visual snapshot tests for their mobile apps. How? Emerge takes advantage of existing Xcode and Android Studio previews and automatically converts them into tests.

We talked with Eric Horacek (iOS) and Anton Tananaev (Android) to learn more about OpenAI's approach to mobile development, testing, and how they use Snapshots.

Questions

Q1. How does the OpenAI team maintain design consistency across platforms?

ChatGPT is still a young product with rapidly evolving features. A good example is the message composer, which gets refined every few months and updated with new functionality. As a result, we need a large cross-platform design system flexible enough to meet the requirements of ChatGPT six months to a year from now.

As a result, cross-platform design consistency is something we spend a lot of time thinking about, building on top of shared components and basic design tokens to keep things consistent. We are continuing to invest in a formalized set of cross-platform design tokens (colors, fonts, spacing, etc.) as another step in this direction.

Q2. Could you tell us a bit about your testing philosophy for the ChatGPT apps? Where does snapshot testing fit in?

Generally, we follow a tiered approach to testing, often referred to as the 'testing pyramid'. This means we primarily use unit tests to validate logic in isolation, followed by integration tests to ensure components work together effectively, and finally end-to-end tests along with manual QA to validate the entire system. Snapshot tests are also used for our native apps, which add visual regression tests to complement the unit tests.

Additionally, we use textual snapshot tests, which validate that systems produce the expected output for a given input by using snapshot files saved to disk that are initially 'recorded' and updated when expected outputs change.

Q3. How do previews factor into your development workflow?

We’re big fans of previews, which are a great tool that IDEs like Xcode and Android Studio offer that allows us to rapidly iterate on UI by viewing live previews of components in different states while working on them.

With much of our native apps written in SwiftUI and Jetpack Compose, we adopt a preview-first approach—building about 80% of the UI in the preview and finalizing the remaining 20% after integrating it into the full application.

(Emerge is also a big believer in previews, you can check out our blog on Preview Driven Development to learn about some of the benefits.)

Q4. You adopted snapshot testing early in the development of the ChatGPT app. Why was it important for the ChatGPT app to have robust snapshot testing from the beginning?

UI snapshot tests really shine when you’re making many changes to existing components, as they help you identify when you’ve introduced accidental regressions in areas you didn’t necessarily test while working on the feature (e.g., dark mode, voice over, etc.).

For example, when redesigning the message composer, snapshot tests were crucial to ensure the previous design didn't regress when we were adding the new one, as the app needed to support both designs at once. As the ChatGPT updates are not only rapidly changing but also ship weekly and have a large amount of user impact, it is paramount that we don’t ship regressions to existing functionality, and that our new features work well out of the gate.

Snapshots of message composer

Visual regression testing via snapshot tests helps us feel much more confident that our changes are not breaking existing functionality in unexpected ways and that our new features behave correctly in all environments, even if they weren’t tested directly by the engineer or QA during the development process.

Plus, since we already relied on previews for development, it was basically zero developer effort to add Emerge's Snapshots.

Q5. The ChatGPT app features complex UI components (chat interface, code blocks, image generation). How do you handle testing these interactions with snapshots?

In our snapshot tests, we focus on exercising all notable states of a UI component or screen, and then render each of those states across as many environments as possible, such as dark mode, with accessibility overlays, on different device layouts like iPad/landscape iPhone/Mac, with large dynamic type sizes, and so on.

For example, we might render our code block UI component with states like a short code snippet, a code snippet with a few long lines that causes it to overflow or wrap, one with and without syntax highlighting, and so on. Then, we render each of these states in all the environments mentioned earlier.

To provide some color, the OpenAI team has 36 different variants of their `MarkdownCodeBlock` that are tested for every build.

Variants of "Short Code"

Variant of "Long Code"

This approach gives us comprehensive test coverage of our UI components in many environments for everything except user interactions and state changes, which are generally tested through other approaches. To make this easy to do in code, we use a PreviewVariants abstraction, which makes it easy to exercise these combinations.

For both iOS/macOS and Android, using previews for snapshot tests makes it easy to produce multiple component variants. This Swift example easily turns any preview into a list of variants, including dark mode, landscape, a11y, etc... You can use it like this:

PreviewVariants(layout: .sizeThatFits) {
      MyView(mode: .loaded)
        .previewVariant(named: "My View - Loaded")
      
      MyView(mode: .loading)
        .previewVariant(named: "My View - Loading")
      
      MyView(mode: .error)
        .previewVariant(named: "My View - Error")
    }
}

Snapshot with landscape and dark mode variant

Snapshot with accessibility overlay

Q6. The ChatGPT app supports a large number of languages. How do you manage this with snapshot testing, especially given the complexity of multilingual UI?

For internationalization of ChatGPT, you might imagine that we could snapshot test our app in a few languages with different characteristics (e.g., more verbose, right-to-left, etc.) to validate that our UI gracefully handles those cases. However, due to the asynchronous nature of localization, where translated strings may not arrive for multiple days after a new string is introduced, we instead rely on pseudolanguage snapshots to exercise these cases.

Specifically, we use right-to-left English snapshots (via -NSForceRightToLeftWritingDirection) as well as double-length English snapshots (via -NSDoubleLocalizedStrings), which ensures that our app gracefully adapts to both more verbose and right-to-left languages without needing translations for those languages to exist yet.

Snapshot with double-length English

On Android, there is no way to test double-length English, but we pick languages that usually have longer translations, like German. We also test Chinese and Arabic (rtl) to test different variations.

German language snapshot

Q7. Emerge's Snapshots are based on converting previews into tests. Why was this approach a good fit for your team?

Emerge’s approach converts previews into snapshots, and doesn’t require us to build complex infrastructure to collect, iterate, and render preview variants.

Furthermore, Emerge's solution is open source (iOS | Android), which allows us to run individual snapshot tests locally in case we need to reproduce issues with flaky tests, unexpected rendering behavior, or need to build custom tooling on top of it in the future.

Q8. Are there specific features of Emerge's Snapshots that you rely on heavily or find particularly useful?

We're very happy with the overall process, as it's made it so easy to review UI differences and approve changes if needed. The web UI for viewing snapshot differences for a given pull request is a great feature that we rely heavily on. Not only is it great for seeing the differences between snapshots, but it does a great job of grouping additions, removals, and modifications by component, as well as gracefully handling things like renames, ignoring common flaky snapshots, etc.

One area that I could imagine the snapshot review experience evolving in the future is having a large language model summarize the snapshot differences, or highlight differences that seem inconsequential vs. those that are more likely to be a regression.

Emerge's status checks have a built in approval workflow, which is also available to review and approve from the web UI. You can check out our docs to see how diffs are reviewed and approved.

Approval workflow through the web UI

Q9. How have you seen Snapshots impact your development process (developer velocity, bug detection, collaboration, etc)?

Snapshots have helped improve our confidence when making both large and small changes to our UI. Rather than trying to remember to test every single state of a component, we now have thousands of UI snapshot tests (iOS/macOS) that run on every pull request, ensuring that all of these states are exercised and checked for regressions across all of our platforms and environments.

Since previews are already a core part of our UI development process, we haven't seen any issues with the adoption of snapshot tests in new features or as we add new team members.

Q10. What advice would you give to other teams who are just starting to implement snapshot testing for their mobile applications?

The most important point is to ensure that your snapshots are consistently delivering value to engineers on your team and not slowing them down or making their lives harder. This requires a low but consistent investment:

Ensure your snapshots are not flaky (both on main and in newly added snapshots). Emerge handles many flakes automatically, but you should still be proactive in identifying and fixing flakes.

Be careful with animations - they can result in flaky tests. On Android, we set Random seed to 0 for all previews for consistent results

Don't exercise duplicate or redundant states
Make it easy to exercise your component states in environments like dark mode. Ideally, no extra code is required
Make your architecture conducive to snapshotting; for example, by adding seams between the view layer and the model/controller layer, which allows the view to be snapshotted in isolation without instantiating many dependencies.

If you'd like to learn more about Emerge's Snapshots, you can check out our docs or sign up for a free trial to try it out.

Share on X|Share on Linkedin|