Debugging a Flutter crash that only happened in production on iOS.

The crash report came in on a Tuesday morning. iPhone 13, iOS 17.5, Flutter 3.22, our app's payment screen. Stack trace was three frames of libsystem_pthread.dylib, two frames of an obfuscated Dart symbol, and a EXC_BAD_ACCESS (SIGSEGV). The crash had three thousand occurrences in the last 24 hours. Not in debug. Not in profile. Only in release.

It took me four days to find. It was not what I thought it was. The clue I needed was in a log I did not collect, the bug was in code I did not write, and the symptom in production was nothing like the cause in development. This is the postmortem, told as the story of how I actually worked through it, not how I should have.

Day 1: the misleading stack trace

The Crashlytics report looked like this:

Crashed: com.apple.main-thread
0  libsystem_pthread.dylib    0x000... pthread_kill + 8
1  libsystem_c.dylib          0x000... abort + 168
2  Runner                     0x000... <symbol obfuscated>
3  Runner                     0x000... <symbol obfuscated>
4  Flutter                    0x000... Dart_PropagateError + ...
5  Flutter                    0x000... Dart_HandleMessage + ...

The two obfuscated frames in Runner were our Dart code. We use --obfuscate --split-debug-info for release builds, so without the symbol map the frames are unreadable. I deobfuscated them with flutter symbolize:

That told me the crash was in our PaymentService.confirmPayment method, on a line that called await _api.post(...). The crash was not in our code; it was in a network call. So I assumed: stale token, server returned malformed JSON, deserialization failure. I spent the rest of Day 1 chasing JSON.

It was not JSON. Every reproduction failed locally. The release build of the same payment screen worked on every test device.

Day 2: the in-app log capture

I added an in-app log capture that wrote the last 200 entries to a local file, included it in the crash bundle Crashlytics uploads, and shipped a hot fix to TestFlight users.

Within 24 hours, the dumps came back. The last log line before every crash was the same:

PaymentService.confirmPayment: starting, amount=...

The line that should have followed — PaymentService.confirmPayment: api response received — never appeared. Something inside the network call was crashing the process before any error handler ran.

Day 3: the platform channel

Our HTTP client used dio for most requests and a native URLSession via a platform channel for one specific call: the payment confirmation. The reason was obscure compliance work that needed iOS-specific certificate pinning behaviour. I had not written that channel; a native engineer had, eight months earlier.

I read the Swift side:

There were three problems:

call.arguments as! [String: Any] forces a cast and crashes on type mismatch.
try! JSONSerialization.jsonObject(with: data!) forces a try and forces an unwrap; either failure is a crash.
result(json) is called on the URLSession callback queue, which is not the main thread.

In debug, the JSON always arrived parsable, the args were always the expected map, and the URLSession callback queue worked because the engine was lenient. In release, with the AOT compilation and tighter engine teardown handling, calling result off the main thread sometimes hit a moment where the engine had already deallocated the binary messenger. That was the EXC_BAD_ACCESS.

The clue I missed for two days: the crashes were concentrated on slow networks. On slow networks the URLSession callback fires later, increasing the chance the engine teardown beat the callback. On a Wi-Fi developer machine the response came back in 80ms and the race window was effectively zero.

Day 4: the fix

I rewrote the Swift handler properly:

Three changes:

as? instead of as!, with a typed error.
try with do/catch, with a typed error.
result invoked on the main thread, always.

Crashes dropped to zero within 24 hours of the release.

How I would have debugged this in a native iOS app

In a pure Swift app, this same bug would have been caught by:

Xcode's runtime checks for Thread Sanitizer (TSan) in release-like builds.
The Address Sanitizer (ASan) catching the after-free access on result after engine teardown.
A careful code review that flagged force-unwrap in production code.

The Flutter equivalent of this is harder. There is no TSan on the Dart side. You have to be disciplined about not crossing thread boundaries unsafely on the platform side, and you have to test release builds, not just debug.

Stack trace flow, illustrated

Caption: the timing race that produced the crash. Calling result on a non-main thread during engine teardown deallocated the binary messenger before the call landed.

What I would do differently

I would have shipped the in-app log capture from day one. I lost a day flailing without it.
I would have audited every platform channel I did not write before shipping, with a checklist for thread hops, error handling, and forced casts.
I would have set up a "release-mode test build" pipeline that ran the app on a slow network simulator and recorded any crash reports, before any production user saw them.
I would have added a CI rule banning as! and try! in our iOS codebase. Either is a crash waiting for the wrong input.
I would not have assumed the obfuscated stack trace told me the answer. The deobfuscated frame was the call site, not the cause. The cause was on the other side of the channel boundary.

Closing opinion

Production-only crashes on iOS are almost always either a release-mode optimization unmasking a latent bug, a thread-safety issue that debug tolerates, or a memory issue that ASan would catch. Build the in-app log capture, deobfuscate every release crash immediately, and audit every platform channel for thread discipline. If you do those three things, you find these bugs in a day, not four. For more on platform channels specifically, see Writing a Flutter platform channel in Swift. For other postmortem-flavoured posts, see Why our Flutter app was 200MB and how we got it down to 38MB.