GenAI features in a Flutter app. What works, what is broken, and what users actually want.

The GenAI feature we shipped in late 2025 was supposed to be the headline of the quarter. It was a "smart compose" assistant that helped users write and structure their notes. The internal demo got a standing ovation. Three weeks after launch, daily active usage of the feature was 6%. Three months in, we removed half of it and replaced the rest with something users actually wanted, which was not what we built.

This post is the postmortem. What worked, what did not, what users wanted that we missed, and what the App Store and Play Store actually allow you to do with on-device versus cloud models. If you are about to ship a GenAI feature in a Flutter app, read this first. The cheap lessons are in here. The expensive ones are too.

Context: what "GenAI feature" usually means

In 2026, when a product manager says "let's add GenAI," they usually mean one of:

A chat interface bolted onto the app.
An auto-complete or suggestion overlay.
A "summarize this" or "explain this" button.
A generative image or audio feature.
An agentic flow that takes actions on the user's behalf.

These have wildly different UX requirements, technical risks, and store policy implications. The mistake we made was treating "GenAI" as one thing. It is five things, and the answer to "should we build it" is different for each.

What worked: the mundane wins

The features that earned their keep were the boring ones.

Smart paste: when the user paste long text into a notes field, an inline button offers to clean it up, fix formatting, and add a heading. One tap, no chat, no preamble. Used by 34% of paste actions in the first month.
Title suggestion: after the user has typed a few sentences, a small chip appears suggesting a title. They tap it or ignore it. Used 18% of the time, dismissed silently the rest.
Search-by-meaning: an embedding-based search over the user's notes that worked alongside the existing keyword search. The user did not know it was AI; they just noticed search got better.

The pattern: the AI is a feature, not a destination. The user does not "go to" the AI. The AI shows up at the right moment with a single, low-commitment action.

What flopped: the obvious thing

The chat interface flopped. It was beautifully designed, streamed responses smoothly, handled errors well, and had a rich prompt template library. Users used it twice and never came back.

When we did interviews, the reason was consistent: "I do not want to talk to my notes app. I want it to do the thing." Users did not want to learn what to ask. They wanted the app to act when there was something obvious to do.

The chat interface is the easiest thing to build and the hardest thing to make useful. It is the GenAI equivalent of a Settings screen with 200 toggles.

Latency UX: the thing we got right by accident

Streaming text display was the single biggest UX improvement. A response that takes 4 seconds to fully arrive feels instant if the first character appears in 200ms and the rest streams smoothly. Compare to a 1.5-second response that arrives all at once: the second is technically faster but feels slower.

Two implementation details that matter:

Stream by token, not by chunk. The provider may give you bigger chunks; split them so the UI updates feel smooth rather than appearing in bursts.
Animate the cursor. A blinking caret at the streaming position tells users "this is still working" without a spinner that screams "you are waiting."

The full architecture for streaming, including the edge proxy, is in Integrating an LLM into a Flutter app.

On-device vs cloud: the tradeoffs

In early 2026 the realistic options for on-device inference in a Flutter app are:

Apple Intelligence (iOS 18.1+, only on supported hardware; access through specific system APIs from Swift, not Dart).
Google AICore / Gemini Nano on supported Android devices.
A bundled small model (TFLite, MLC-LLM, or similar) running through tflite_flutter or a custom plugin.

The honest comparison:

Caption: a real comparison between on-device and cloud inference. Neither is universally correct; the right choice depends on the feature.

What we ended up doing:

Auto-complete and quick suggestions: bundled small model on-device. The latency budget was 50ms; cloud could not meet it.
Long-form generation, summarization, and search: cloud. Quality difference at this length is large enough to be visible.
Embeddings for search: precomputed on-device using a small embedding model, then queried locally. No cloud cost, no privacy issue.

What the stores actually allow

Both stores have policies that affect GenAI features. The summary, accurate as of early 2026:

App Store: Generative AI features must include reporting mechanisms for objectionable content, content filters appropriate to the audience, and clear disclosure when content is AI-generated. Apps that generate images or text from user prompts are subject to additional scrutiny. Apple Intelligence integrations must follow specific entitlements and disclosure requirements.
Play Store: Generative AI apps and features must implement safe-experience guidelines, including content filtering and an in-app way to flag offensive content. Apps targeting children have stricter requirements.

If you build a feature that generates content from user prompts, plan for:

An in-app report button on every generated artifact.
A content moderation pipeline (provider-side, edge-side, or both).
A clear disclosure that the content is AI-generated.
A privacy policy update describing what is sent to which provider.

Skipping any of these will get the app rejected at review. We learned this the slow way; the third submission was the one that passed.

The cost economics, real numbers

For our app at 100k DAU with the cloud features:

Cost per active user per day: roughly $0.02 with gpt-4o-mini-tier models and aggressive truncation.
Cost per active user per day: roughly $0.18 if we used the highest-tier model. Quality difference for our use case did not justify a 9x cost.
On-device features: zero per-request cost, $0 marginal. Tradeoff was the 180MB increase in install size for the bundled model.

A user-budget at the edge ($X per user per day, hard cap) is required. Without it, a single user with a script can rack up hundreds of dollars in a day. We cap and rate-limit at the edge, the same edge that fronts every LLM call from the app.

What I would do differently

I would not have shipped the chat interface. I would have shipped only the "smart paste" and "title suggestion" features and listened for what users asked for next.
I would have started with on-device for everything that could plausibly fit. The latency win and the zero-cost are powerful, and you can fall back to cloud for the cases that need it.
I would have built the moderation pipeline before the feature, not after the App Store rejection. It cost us two weeks of release delays.
I would have hidden the AI label on the boring wins. Users did not want to know "this is AI." They just wanted search to work better.
I would have built a kill switch on the edge from day one. When gpt-4o-mini had a 30-minute incident, we wanted to fall back to a cached or simpler experience instantly. Without a kill switch, we just showed errors.

What users actually wanted

In retrospect, the user research told us this and we did not listen:

Users wanted things to be done for them, not tools to do things with.
Users wanted the AI to be invisible when it was right and obvious when it was wrong (so they could correct it).
Users wanted control over when the AI activated, especially on personal data.
Users wanted the result, not the conversation.

Every flop was a feature that asked the user to engage with the AI as a thing. Every win was a feature where the AI was a quiet improvement to something they were already doing.

Closing opinion

Build inline, low-commitment, AI-assisted improvements to actions users already take. Do not build chat interfaces unless your product is genuinely a chat product. Use on-device models where latency or privacy matters; use cloud where quality matters. Cap costs at the edge with a per-user budget. Plan for moderation before you plan for features. If you do those things, your GenAI feature will earn its keep instead of becoming a quarterly demo nobody uses. For the technical foundation, see Integrating an LLM into a Flutter app. For the dev-side companion, see Building an AI-powered code review assistant.