Open Source / Runtime Stack / On-device AI

The OSS stack I actually use for local-model product delivery.

These are not random links. Each repository is mapped to where it fits in shipping iOS AI products, what can go wrong, and how to combine it with product constraints.

Core Runtime

llama.cpp

The default local runtime for GGUF inference when latency predictability and deployment control matter more than framework convenience.

My view Best baseline for on-device text inference when you need tight control over memory and token throughput.

Use when You need local chat, extraction, or rewrite flows with deterministic prompt contracts.

Watch-outs Quantization choice and context window configuration can easily break UX consistency.

GitHub repo Internal note

Swift Integration

LlamaSwift

A Swift wrapper around llama.cpp that speeds up iOS integration and reduces glue-code complexity for app teams shipping local models.

My view Good for fast Swift MVP cycles when you need a production-oriented native interface.

Use when You want local model features in iOS without maintaining a large C++ bridge surface yourself.

Watch-outs Pin versions carefully so wrapper updates do not silently alter runtime behavior.

GitHub repo See shipped products

Speech Layer

WhisperKit

For on-device speech recognition on Apple platforms, WhisperKit is one of the fastest ways to add practical voice input loops in local-first products.

My view Great complement to GGUF chat products where voice-to-intent is a key entry point.

Use when You need real-time transcription on device for privacy-sensitive mobile flows.

Watch-outs Background performance and thermal behavior must be tested on lower-end iPhones.

GitHub repo Read more notes

Apple Native AI

Apple Foundation Models

For supported devices, Apple Foundation Models are useful as a native completion layer beside GGUF runtime paths, especially for rewrite and concise summarization.

My view Not a replacement for your full runtime stack, but strong as a native capability enhancer.

Use when You want lower integration friction and Apple-native behavior in app-level AI assistance.

Watch-outs Feature gating by OS/hardware must be explicit so unsupported devices keep clean fallbacks.

Apple docs Internal note