Integrating an LLM into a Flutter app without it feeling like a science project.

The first LLM integration I shipped in a Flutter app worked beautifully in the demo and embarrassingly in production. The demo had one user, a strong network, and predictable prompts. Production had thousands of users, half of them on transit Wi-Fi, and a billing dashboard that hit a thousand dollars a day inside the first week. The feature was good. The architecture was naive.

This post is the architecture I would build today. Not the demo version. The version that handles streaming responses, error states a user can act on, network conditions that break things, and a cost model that does not surprise the CFO. I will use OpenAI as the example provider because the API shape is well-known, but the same patterns apply to Gemini, Anthropic's Claude API, and most other providers.

Context: what is actually different about LLM calls

Compared to a typical REST call, an LLM API call is:

Slower (hundreds of milliseconds to many seconds, depending on output length).
Streaming-first if you want a good UX (Server-Sent Events or chunked transfer).
Stateful in cost (every token in and out is billed).
Variable in shape (you control the latency by controlling the output length).
Sensitive to context size (each call may carry the whole conversation history).

A native developer's intuition that "an HTTP call is an HTTP call" is wrong here. The right intuition is closer to streaming video: progressive rendering matters, mid-stream errors are real, and bandwidth costs are nontrivial.

The architecture, top down

[ Flutter UI ]
    │  Riverpod notifier with AsyncNotifier
    │
[ ChatRepository ]      <-- where the API contract is owned
    │  Stream<ChatChunk> sendMessage(...)
    │
[ LlmClient ]           <-- thin HTTP layer, knows about SSE
    │  http.Client / dio
    │
[ Edge / your backend ] <-- DO NOT call OpenAI directly from the app
    │
[ LLM provider ]

Three things to notice:

The app does not call the LLM provider directly. It calls your backend, which calls the provider. This is non-negotiable.
The repository returns a Stream<ChatChunk>, not a Future<String>. Streaming is the right shape for the UI.
The Riverpod notifier owns the state machine: idle, requesting, streaming, errored, complete.

Why the app must not call the provider directly

Putting your OpenAI API key in the Flutter app is the same as publishing it. Mobile apps can be reverse-engineered; flutter strings end up in the binary. Even with obfuscation, the key is extractable. Beyond the security issue:

You cannot rotate keys without an app update.
You cannot rate-limit per user without auth on the device.
You cannot enforce content moderation centrally.
You cannot cache responses or fan out to a cheaper model.
You cannot measure cost per user.

Your backend gets a thin proxy that takes your authenticated user, applies a rate limit, optionally rewrites or filters the prompt, calls the provider, and streams the response back. Cloudflare Workers, a small Node service, a Fly.io app, or your existing API can all do this in a few hundred lines.

The streaming UX in Dart

Server-Sent Events from your edge to the app, parsed into a Stream<ChatChunk>:

The Riverpod notifier:

The widget:

The full request lifecycle

Caption: every layer of the request, including the edge that protects your API key. Each arrow is a real boundary that can fail and must be handled.

Threading: a native developer's instinct

A native developer's first instinct is to do this work on a background thread. In Flutter, the HTTP and JSON parsing is fine on the UI isolate; the work is I/O-bound and small. The exception is if you receive very large chunks and parse complex JSON per token, which can stall a frame. In that case, parse on a background isolate using compute or a long-lived isolate worker.

The Swift equivalent of streaming would use URLSession.bytes(for:) and an AsyncSequence. The Kotlin equivalent would use OkHttp with a streaming response and kotlinx.coroutines.flow. The shape is the same.

Cost-aware design

Three rules I now follow:

Set a max_tokens ceiling on every request from the edge. Users cannot ask the model to produce a million tokens by accident or on purpose.
Truncate conversation history. Either use a sliding window (last N turns) or summarize older turns into a single context message.
Default to the cheapest model that meets the quality bar, and only escalate when you have evidence you need it.

Errors users can act on

The model can return: rate limit, content filter rejection, network timeout, invalid prompt, server overload. Treat each as a typed error, not a generic message:

Each maps to a distinct UI state. Rate limit shows "try again in a moment." Overload shows "the assistant is busy." Auth failure forces re-login. Generic errors offer retry. Users hate "something went wrong"; they tolerate specific advice.

What I would do differently

I would have built the edge proxy on day one, not "after the prototype." Putting the API key in the app once almost shipped to TestFlight.
I would have set max_tokens and a per-user daily budget on the edge from day one. The bill in week one was the most expensive lesson of my career.
I would have used a Stream interface from the first prototype, not a Future<String>. Migrating from non-streaming to streaming after the UI was built doubled the work.
I would have logged token counts and latency per request to my own metrics from day one. Without those, you cannot reason about cost or UX.
I would have built the cancel button before the send button. Long-running streams that cannot be cancelled are a UX failure I shipped twice.

Closing opinion

Build the edge first, the streaming next, the cost ceiling third, the model choice last. If you do those in that order, the integration ships smoothly. If you do them in reverse, you ship fast and pay for it. For the AI workflow side of the same world, see Building an AI-powered code review assistant inside your Flutter dev workflow. For the postmortem of shipping a GenAI feature, see GenAI features in a Flutter app.