Building On-Device AI for iOS: My Practical Guide to Frameworks, Code, and What Actually Works
A field guide from Jasonwei on picking the right iOS AI stack, running local generation, and shipping stable product behavior.
Developers keep asking me the same question: cloud APIs are useful, but how do we run AI directly on iPhone in a production app? This article is my practical answer from real project work, not theory slides.
On-device AI unlocks lower latency, better privacy, offline capability, and lower serving cost. But it also introduces hard constraints: memory ceilings, battery pressure, model size limits, and runtime tuning complexity.
1. The Three-Layer Mental Model I Use
- High-level: Apple system frameworks for fast integration and OS-native behavior.
- Mid-level: Core ML for custom model shipping with explicit input/output control.
- Low-level: Open-source local runtimes (for token streaming and local LLM chat UX).
In real products, I often combine all three instead of forcing one stack to solve every use case.
2. Foundation Models / Apple Intelligence APIs
When platform support is available, this is the fastest way to add privacy-first AI behaviors without bundling huge model artifacts.
When I choose this path
- I need quick product velocity with minimal ML infrastructure overhead.
- I can require modern iOS versions for feature availability.
- I want system-level performance and privacy defaults.
These APIs evolve quickly, so I treat integration points as replaceable adapters.
3. Core ML: The Most Reliable Production Base
If I need deterministic behavior and predictable app release flow, Core ML is still my default layer.
Typical fit
- Embedding generation, classification, reranking, and lightweight generation workflows.
- Models packaged as
.mlmodelor.mlpackage. - Neural Engine acceleration where supported.
Conceptual embedding service
import CoreML
final class EmbeddingService {
private let model: MLModel
init() throws {
let url = Bundle.main.url(forResource: "MyEmbeddingModel", withExtension: "mlmodelc")!
self.model = try MLModel(contentsOf: url)
}
func embed(text: String) throws -> [Double] {
let input = try MLDictionaryFeatureProvider(dictionary: ["text": text])
let out = try model.prediction(from: input)
// Convert MLMultiArray to [Double] for your vector pipeline
_ = out
return []
}
}
4. Running Local LLM Generation on iOS
For ChatGPT-like interaction, I use a lightweight runtime with quantized models and stream tokens into SwiftUI. This is where product engineering matters most.
My default sequence
- Start with small quantized models (1B-3B class) to guarantee first success.
- Integrate a mobile-friendly runtime (for example a
llama.cpp-based wrapper). - Stream token output into UI to hide generation latency.
- Measure cold-start, warm-start, and memory peaks before scaling model size.
Rule: UX quality and guardrails usually beat raw model size in mobile apps.
5. A Practical Project Layout
OnDeviceLLMDemo/
App/
OnDeviceLLMDemoApp.swift
Features/
Chat/
ChatView.swift
ChatViewModel.swift
AI/
LLM/
LLMClient.swift
LLMConfig.swift
TokenStreamer.swift
Embeddings/
EmbeddingService.swift
Utils/
PerformanceMonitor.swift
This split keeps model lifecycle, UI, and instrumentation independent, which makes debugging and tuning far easier during release cycles.
6. Minimal SwiftUI Streaming Pattern
@MainActor
final class ChatViewModel: ObservableObject {
@Published var messages: [String] = []
@Published var isGenerating = false
private let llm: LLMClientProtocol
private var task: Task<Void, Never>?
init(llm: LLMClientProtocol) {
self.llm = llm
}
func send(prompt: String) {
isGenerating = true
task?.cancel()
task = Task {
do {
var response = ""
try await llm.generate(prompt: prompt) { token in
response += token
if self.messages.isEmpty { self.messages.append("") }
self.messages[self.messages.count - 1] = response
}
} catch {
self.messages.append("Error: \\(error.localizedDescription)")
}
self.isGenerating = false
}
}
}
7. How I Decide Between Frameworks
| Need | Best first choice | Why |
|---|---|---|
| Fast AI feature in modern iOS | Apple system frameworks | Lowest integration cost, strong system defaults |
| Custom deterministic model workflow | Core ML | Predictable model packaging and deployment |
| Token-by-token local chat | Open-source LLM runtime | Best control over generation and streaming UX |
8. Jasonwei's Shipping Checklist
- Benchmark cold-start and warm-start separately.
- Log memory peaks and generation throughput by device model.
- Treat model load failures as a first-class user flow.
- Prefer smaller models with stronger prompt constraints first.
- Only scale model size after stable UX and crash-free sessions.