iOS On-Device AI

Building On-Device AI for iOS: My Practical Guide to Frameworks, Code, and What Actually Works

A field guide from Jasonwei on picking the right iOS AI stack, running local generation, and shipping stable product behavior.

2026-05-01iOSOn-Device AICore MLLLM Runtime

Developers keep asking me the same question: cloud APIs are useful, but how do we run AI directly on iPhone in a production app? This article is my practical answer from real project work, not theory slides.

On-device AI unlocks lower latency, better privacy, offline capability, and lower serving cost. But it also introduces hard constraints: memory ceilings, battery pressure, model size limits, and runtime tuning complexity.

1. The Three-Layer Mental Model I Use

  • High-level: Apple system frameworks for fast integration and OS-native behavior.
  • Mid-level: Core ML for custom model shipping with explicit input/output control.
  • Low-level: Open-source local runtimes (for token streaming and local LLM chat UX).

In real products, I often combine all three instead of forcing one stack to solve every use case.

2. Foundation Models / Apple Intelligence APIs

When platform support is available, this is the fastest way to add privacy-first AI behaviors without bundling huge model artifacts.

When I choose this path

  • I need quick product velocity with minimal ML infrastructure overhead.
  • I can require modern iOS versions for feature availability.
  • I want system-level performance and privacy defaults.

These APIs evolve quickly, so I treat integration points as replaceable adapters.

3. Core ML: The Most Reliable Production Base

If I need deterministic behavior and predictable app release flow, Core ML is still my default layer.

Typical fit

  • Embedding generation, classification, reranking, and lightweight generation workflows.
  • Models packaged as .mlmodel or .mlpackage.
  • Neural Engine acceleration where supported.

Conceptual embedding service

import CoreML

final class EmbeddingService {
  private let model: MLModel

  init() throws {
    let url = Bundle.main.url(forResource: "MyEmbeddingModel", withExtension: "mlmodelc")!
    self.model = try MLModel(contentsOf: url)
  }

  func embed(text: String) throws -> [Double] {
    let input = try MLDictionaryFeatureProvider(dictionary: ["text": text])
    let out = try model.prediction(from: input)
    // Convert MLMultiArray to [Double] for your vector pipeline
    _ = out
    return []
  }
}

4. Running Local LLM Generation on iOS

For ChatGPT-like interaction, I use a lightweight runtime with quantized models and stream tokens into SwiftUI. This is where product engineering matters most.

My default sequence

  1. Start with small quantized models (1B-3B class) to guarantee first success.
  2. Integrate a mobile-friendly runtime (for example a llama.cpp-based wrapper).
  3. Stream token output into UI to hide generation latency.
  4. Measure cold-start, warm-start, and memory peaks before scaling model size.

Rule: UX quality and guardrails usually beat raw model size in mobile apps.

5. A Practical Project Layout

OnDeviceLLMDemo/
  App/
    OnDeviceLLMDemoApp.swift
  Features/
    Chat/
      ChatView.swift
      ChatViewModel.swift
  AI/
    LLM/
      LLMClient.swift
      LLMConfig.swift
      TokenStreamer.swift
    Embeddings/
      EmbeddingService.swift
  Utils/
    PerformanceMonitor.swift

This split keeps model lifecycle, UI, and instrumentation independent, which makes debugging and tuning far easier during release cycles.

6. Minimal SwiftUI Streaming Pattern

@MainActor
final class ChatViewModel: ObservableObject {
  @Published var messages: [String] = []
  @Published var isGenerating = false

  private let llm: LLMClientProtocol
  private var task: Task<Void, Never>?

  init(llm: LLMClientProtocol) {
    self.llm = llm
  }

  func send(prompt: String) {
    isGenerating = true
    task?.cancel()
    task = Task {
      do {
        var response = ""
        try await llm.generate(prompt: prompt) { token in
          response += token
          if self.messages.isEmpty { self.messages.append("") }
          self.messages[self.messages.count - 1] = response
        }
      } catch {
        self.messages.append("Error: \\(error.localizedDescription)")
      }
      self.isGenerating = false
    }
  }
}

7. How I Decide Between Frameworks

Need Best first choice Why
Fast AI feature in modern iOS Apple system frameworks Lowest integration cost, strong system defaults
Custom deterministic model workflow Core ML Predictable model packaging and deployment
Token-by-token local chat Open-source LLM runtime Best control over generation and streaming UX

8. Jasonwei's Shipping Checklist

  • Benchmark cold-start and warm-start separately.
  • Log memory peaks and generation throughput by device model.
  • Treat model load failures as a first-class user flow.
  • Prefer smaller models with stronger prompt constraints first.
  • Only scale model size after stable UX and crash-free sessions.

Related Reading