Running Gemma 2B on iOS: Reducing Metal Shader Startup From 9.7s to 1.2s

The first time I loaded a Gemma 2B GGUF model inside an iOS build, I blamed the wrong part of the stack.

I expected quantization to be the hard part. I expected memory pressure to be the hard part. I expected the model file, tokenizer, and context window to fight me. All of those mattered, but none of them caused the most visible failure in the product.

The real failure was simpler: the first model load made the app look frozen.

On an iPhone 12 test device, a naive llama.cpp + Metal initialization path blocked for roughly 9.7 seconds before the first usable inference. On an iPhone 14 Pro, the same path was much more tolerable, but still too visible for a consumer app. On a recent Pro device it looked like an engineering inconvenience. On older hardware it looked like a broken product.

That distinction is where on-device AI engineering becomes real. A demo is allowed to ask the user to wait. A consumer app is not.

Figure 1: The optimization was not one trick. It was a product-level startup pipeline that moved Metal compilation away from the user's first critical action.

The failure mode: shader compilation, not model reading

The naive initialization looked innocent:

var params = llama_context_default_params()
params.n_gpu_layers = 99
params.n_ctx = 2048

let context = llama_new_context_with_model(model, params)

The model file had already been memory-mapped. The tokenizer was not the bottleneck. The expensive step was the creation of the Metal execution path. In Instruments, the startup spike clustered around Metal library and pipeline construction, especially the path where the backend had to compile or specialize shader code before the first real run.

The exact timing depends on the model, quantization, llama.cpp version, iOS version, and device. My working measurement looked like this:

Device	Model	Quantization	First context load	Warm context load	Notes
iPhone 12	Gemma 2B	Q4 class	9.7s	1.8s-2.4s	worst visible startup
iPhone 14 Pro	Gemma 2B	Q4 class	4.1s	1.2s-1.6s	acceptable only if hidden
Recent Pro device	Gemma 2B	Q4 class	2.0s-2.8s	under 1.2s	demo-friendly, still not free

These numbers are not universal benchmarks. They are the kind of device-lab numbers I actually care about when shipping a consumer iOS app: "Will the user think the app is dead?"

The architecture I ended up using

I stopped treating model load as a view concern. The model lifecycle became its own subsystem with four states:

enum LocalModelState: Equatable {
    case unavailable(reason: String)
    case cold
    case warming(progress: Double?)
    case ready
    case failed(message: String)
}

The SwiftUI layer only observes state. It never owns llama_context, never decides Metal parameters, and never blocks on model construction.

@MainActor
final class ModelStore: ObservableObject {
    @Published private(set) var state: LocalModelState = .cold

    private let runner = LocalModelRunner()

    func prewarmIfNeeded() {
        guard state == .cold else { return }
        state = .warming(progress: nil)

        Task {
            do {
                try await runner.prepare()
                state = .ready
            } catch {
                state = .failed(message: error.localizedDescription)
            }
        }
    }
}

The heavy work lives behind an actor:

actor LocalModelRunner {
    private var context: OpaquePointer?

    func prepare() async throws {
        if context != nil { return }

        context = try await Task.detached(priority: .userInitiated) {
            let model = try ModelLoader.loadBundledGGUF(named: "gemma-2b-q4")

            var params = llama_context_default_params()
            params.n_ctx = 2048
            params.n_gpu_layers = 99
            params.flash_attn = true

            guard let ctx = llama_new_context_with_model(model.pointer, params) else {
                throw LocalModelError.contextCreationFailed
            }
            return ctx
        }.value
    }
}

The important part is not just Task.detached. The important part is ownership. A local model context is a scarce, stateful resource. If multiple SwiftUI screens can create it independently, the app eventually gets duplicate loads, memory spikes, cancellation bugs, and inconsistent shader cache behavior.

Why "just do it in the background" is incomplete

Moving initialization off the main thread prevents UI lockup, but it does not make the model usable faster. It only changes how the failure feels. The first serious product decision is where to spend the unavoidable cold-start cost.

For my social training flow, users did not need the model on the first static screen. They needed it when they started an analysis session. That gave me a prewarm window:

App launches.
User sees recent sessions and local history.
The app starts model prewarm after the first frame is stable.
The model is ready before the user reaches the first heavy interaction.

I do not prewarm before the UI paints. I also do not prewarm blindly after every launch. The trigger uses simple product heuristics:

struct PrewarmPolicy {
    let hasRecentSessions: Bool
    let batteryLevel: Float
    let isLowPowerModeEnabled: Bool
    let thermalState: ProcessInfo.ThermalState

    var shouldPrewarm: Bool {
        hasRecentSessions &&
        batteryLevel > 0.20 &&
        !isLowPowerModeEnabled &&
        thermalState != .serious &&
        thermalState != .critical
    }
}

That policy matters. On-device AI is not only latency engineering. It is also being polite to the device.

Stable Metal parameters are part of the cache contract

The second source of avoidable startup cost was accidental parameter drift. If context parameters, model file, backend flags, or shader specialization inputs change between runs, the platform may not reuse the same compiled work. Some cache behavior is controlled by the system, but my job is to avoid invalidating it casually.

I now treat model runtime configuration as a versioned artifact:

struct RuntimeProfile: Codable, Hashable {
    let modelName: String
    let modelSHA256: String
    let quantization: String
    let contextLength: Int
    let gpuLayerCount: Int
    let backend: String
    let appModelRuntimeVersion: Int
}

This profile is logged on every run. When startup time regresses, I can see whether the app changed a runtime parameter or whether the device had a real cold path.

I also keep a small local startup sample:

struct ModelStartupSample: Codable {
    let profile: RuntimeProfile
    let deviceModel: String
    let osVersion: String
    let coldStartMS: Int
    let warmStartMS: Int?
    let createdAt: Date
}

For a shipped product, these numbers matter more than a single benchmark screenshot. They tell me whether users on older devices are paying the cost again and again.

The failed approach: initialize at first tap

My first implementation initialized the model after the user tapped the main action. That is the worst possible moment. It binds the largest latency spike to the user's highest-intent action.

The user thinks:

I tapped the button and the app stopped.

The better design is:

I opened the app, browsed naturally, and the heavy system became available before I needed it.

This is why I consider model lifecycle a product architecture problem, not only a low-level Metal problem. A technically identical 4-second compile can be acceptable or unacceptable depending on where it happens in the user journey.

What I would optimize next

The remaining hard problem is memory. A warm model is not free. Keeping a 2B-class model ready can pressure other parts of the app, especially if the product also uses photo import, speech recognition, or SwiftData queries.

My current rule is conservative:

Situation	Model policy
Active analysis session	keep warm
User browsing recent results	prewarm if device is healthy
Low Power Mode	avoid prewarm
Serious thermal state	unload or delay
Memory warning	release context and keep resumable UI

The local model should feel like a native capability, but it should not behave like it owns the phone.

The engineering conclusion

Running Gemma 2B on iOS is not hard because one API call is difficult. It is hard because the model runtime crosses product startup, Swift concurrency, Metal compilation, memory pressure, and user perception.

The practical optimization was not "make Metal faster." It was:

Move model ownership out of SwiftUI views.
Prewarm after the first stable UI moment.
Keep runtime parameters stable enough for warm paths.
Measure cold and warm startup separately on real devices.
Make the product usable while the model is becoming ready.

On-device AI becomes convincing when the user stops noticing that a model is loading at all.

FAQ

Why is first-load latency much worse than warm-load latency on iOS local models?

The first load often pays Metal pipeline and shader specialization costs, plus context initialization. Warm loads can reuse prepared runtime paths, so perceived startup can drop significantly.

Is moving model initialization to a background thread enough?

No. Background initialization avoids UI lockups, but product experience still depends on when startup cost is paid in the user journey. Prewarm timing and lifecycle ownership are critical.

What is the safest architecture for local model ownership in SwiftUI apps?

A dedicated model lifecycle layer with explicit state transitions is safer than view-owned contexts. It reduces duplicate loads, memory spikes, and inconsistent warm-path behavior.

Which metrics should be tracked to prevent regression?

Track cold-start and warm-start latency separately, log runtime profile parameters, and segment by device and OS version. This makes startup regressions diagnosable in production.