Prompt Engineering for 2B On-Device Models: Treat the Prompt Like a Compiler Interface
Large cloud models make developers lazy in a very specific way. They let us write prompts like informal emails and still get usable output.
Large cloud models make developers lazy in a very specific way. They let us write prompts like informal emails and still get usable output.
Small on-device models do not forgive that style.
When I moved a social analysis workflow from a cloud model to a 2B-class local model, the first version failed in a way that looked almost random. Sometimes the model ignored the output format. Sometimes it repeated the instruction. Sometimes it gave a confident score but no reason. Sometimes it produced a valid JSON object for one input and a broken half-object for the next.
The model was not "bad" in a generic sense. I was using the wrong interface.
A small local model should not be treated like a remote expert. It should be treated like a constrained function with a fragile parser, a small working memory, and a strong dependence on prompt locality.
The prompt that failed
The cloud prompt looked like this:
You are a helpful social communication coach. Analyze the user's message.
Explain whether it sounds boring, anxious, warm, or engaging.
Be kind and practical. Give a score from 1 to 10, a reason, and a better version.
Return the answer in JSON.
Message:
"I guess that's fine."
This is readable for a human. It is also full of ambiguity for a small model:
- The role is soft.
- The output schema is not explicit.
- The model has to infer allowed labels.
- The instruction "be kind" competes with "return JSON."
- The requested output has too many degrees of freedom.
With a large model, this is usually fine. With a 2B local model, it is an invitation to drift.
The compiler-style prompt
The version that behaved consistently was much less charming:
<task>
classify_social_message
</task>
<input>
I guess that's fine.
</input>
<labels>
tone: warm | dry | anxious | playful | unclear
risk: none | manipulative | hostile | self_harm | sexual
</labels>
<rules>
- Return one JSON object only.
- score is an integer from 1 to 10.
- reason is max 18 words.
- rewrite is max 22 words.
- Do not include markdown.
</rules>
<json>
{"score":
That final prefix is deliberate. I do not ask the model to decide how to begin. I begin the object for it.
This is the core shift: the prompt is not a conversation. It is a constrained protocol.
Why schema locality matters
Small models are especially sensitive to distance between instruction and output. If the schema appears far above the actual generation point, adherence drops. I keep the schema immediately before the output prefix.
I also avoid long few-shot examples unless the task truly needs them. Few-shot prompting can improve reasoning in larger models, but on a small model it often steals context from the actual input and encourages imitation of the example instead of execution of the task.
My working rule:
| Prompt element | Cloud model tolerance | 2B local model tolerance |
|---|---|---|
| Long role paragraph | high | low |
| Soft style instructions | high | low |
| Many labels | medium | low |
| Nearby schema | useful | critical |
| Output prefix | optional | highly useful |
| Long few-shot set | often useful | risky |
The smaller the model, the more the prompt should look like an instruction packet.
The Swift side matters as much as the prompt
A prompt is not complete until the parser exists. I use typed output and a repair path:
struct SocialPromptResult: Decodable {
let score: Int
let tone: Tone
let risk: Risk
let reason: String
let rewrite: String
}
enum Tone: String, Decodable {
case warm, dry, anxious, playful, unclear
}
enum Risk: String, Decodable {
case none, manipulative, hostile, self_harm, sexual
}
The parser is strict about shape but tolerant about recoverable text around the object:
enum JSONExtractor {
static func firstObject(in text: String) throws -> Data {
guard
let start = text.firstIndex(of: "{"),
let end = text.lastIndex(of: "}")
else {
throw PromptParseError.noJSONObject
}
let slice = text[start...end]
return Data(String(slice).utf8)
}
}
I do not want the UI layer trying to interpret broken model prose. If parsing fails, the engine retries with a stricter prompt or returns a controlled fallback.
actor PromptExecutor {
let model: LocalModelSession
func run(_ prompt: String) async throws -> SocialPromptResult {
let raw = try await model.generate(
prompt,
maxTokens: 120,
temperature: 0.0,
stop: ["</json>", "\n\n"]
)
do {
let data = try JSONExtractor.firstObject(in: raw)
return try JSONDecoder().decode(SocialPromptResult.self, from: data)
} catch {
return try await retryWithStricterPrompt(original: prompt)
}
}
}
This is not glamorous, but it is what makes small models usable in product code.
The metric I track: valid structured output rate
For local models, I care less about eloquence and more about contract adherence. My first quality metric is the valid structured output rate:
valid_rate = valid_json_outputs / total_model_calls
Then I track semantic acceptance separately:
accepted_rate = outputs_user_did_not_regenerate / total_valid_outputs
Those two metrics should not be merged. A syntactically valid bad answer is a different failure from an unparsable answer.
My test set includes short, ambiguous, emotional, hostile, and empty inputs:
| Input class | Failure I look for |
|---|---|
| Very short messages | overconfident interpretation |
| Sarcastic messages | wrong tone label |
| Hostile prompts | unsafe rewrite |
| Empty input | fake analysis |
| Long pasted chat | schema loss |
The small model improves fastest when the test set is boring and repeatable.
The failed approach: more instruction
When the first prompt failed, my instinct was to add more instruction. That made the result worse.
More instruction increases the chance that the model will optimize for the wrong part of the prompt. For a small model, clarity often means removing text, not adding it.
The best improvements came from:
- Reducing labels.
- Moving the schema closer to the output.
- Lowering
maxTokens. - Using output prefix forcing.
- Adding parser-level retry.
- Moving nuanced explanation into app logic where possible.
For example, the model returns tone: anxious. The app can map that to a user-facing explanation. The model does not need to write the entire UX.
The product conclusion
Prompt engineering for small local models is not about making the model sound smart. It is about making the model dependable enough to become a hidden product component.
The highest-value pattern is not a chat box. It is a typed local function:
func classify(_ text: String) async throws -> SocialPromptResult
Once the model becomes a function, it can power buttons, background analysis, tagging, routing, draft checks, and privacy-sensitive workflows without asking the user to watch a model think.
That is where on-device AI becomes interesting to me. Not as a weaker chatbot, but as a fast local subsystem with a narrow contract and predictable failure handling.