GitHub Breakdowns

AI Wrapper Architecture: Separating Control Flow, Streaming Flow, and Product State

Most failed AI wrappers do not fail because the underlying model is weak. They fail because the application treats model generation as a normal request-response feature.

2026-04-05ArchitectureBFFAI WrapperSystem DesignStreaming

Most failed AI wrappers do not fail because the underlying model is weak. They fail because the application treats model generation as a normal request-response feature.

That is the wrong abstraction.

An LLM product has at least three flows that should not be collapsed into one endpoint:

  1. Control flow: authentication, permissions, credits, session creation, feature flags.
  2. Streaming flow: token delivery, cancellation, timeout handling, retry boundaries.
  3. Product state: conversation history, citations, generated artifacts, billing events, analytics.

When developers put all three flows into a single serverless route, the product becomes fragile. A slow database query can interrupt a token stream. A billing write can delay first token time. A vector search timeout can leave the UI in a half-generated state. The user does not know which subsystem failed. They only see a response that stops mid-sentence.

This is why I prefer a BFF-oriented architecture for serious AI products. The BFF is not just a "backend for the frontend" in the old mobile sense. It is the boundary that keeps the user-facing stream clean while slower product state work happens outside the critical path.

Client BFF auth + policy Stream gateway SSE / WebSocket LLM Queue side effects State workers save, meter, index
Figure 1: A production AI wrapper should keep the streaming path thin and move slow product state updates behind durable asynchronous boundaries.

The anti-pattern: one route that does everything

The common implementation looks like this:

export async function POST(req: Request) {
  const user = await authenticate(req)
  await checkCredits(user.id)
  const conversation = await db.conversation.findUnique(...)
  const context = await vectorDB.search(...)
  const stream = await openai.chat.completions.create({ stream: true, ... })
  await db.message.create(...)
  await decrementCredits(user.id)
  return streamToResponse(stream)
}

This code is attractive because it is easy to understand locally. It is also where production problems begin.

The endpoint has mixed responsibilities:

  1. It authenticates.
  2. It reads product state.
  3. It performs retrieval.
  4. It starts a long-running stream.
  5. It writes billing state.
  6. It writes generated content.
  7. It owns the client-visible failure.

Every await before the first token increases time to first token. Every await during or after the stream adds a place where the user-visible request can fail for reasons unrelated to generation quality.

The architecture I prefer

I separate the system into a command path and a stream path.

The command path creates an intent:

type GenerationCommand = {
  userId: string
  conversationId: string
  messageId: string
  modelPolicy: "fast" | "deep" | "local-first"
  promptRef: string
  createdAt: string
}

The BFF validates the request and returns a stream token:

export async function POST(req: Request) {
  const user = await requireUser(req)
  const body = await req.json()

  const command = await createGenerationCommand({
    userId: user.id,
    conversationId: body.conversationId,
    input: body.input
  })

  return Response.json({
    streamUrl: `/api/stream/${command.messageId}`,
    messageId: command.messageId
  })
}

The stream endpoint is intentionally thin:

export const runtime = "edge"

export async function GET(
  _req: Request,
  { params }: { params: { messageId: string } }
) {
  const command = await loadStreamCommand(params.messageId)

  const stream = await modelGateway.stream({
    promptRef: command.promptRef,
    policy: command.modelPolicy
  })

  return toServerSentEvents(stream, {
    onFinal: finalText => enqueueFinalization(command.messageId, finalText),
    onError: error => enqueueFailure(command.messageId, error)
  })
}

The database write for final text is not in the critical path of token delivery. Billing reconciliation, embeddings, audit logs, and analytics become worker jobs.

Why this matters for user experience

Streaming has one main job: keep the user in the loop while the model is working. It should not be used as a hiding place for unrelated latency.

The metrics I track separately:

Metric Definition Owner
Command latency client request to stream URL returned BFF
TTFT stream opened to first token stream gateway
Stream completion rate streams reaching final event gateway + model
State finalization lag final token to durable save worker
Retrieval latency context lookup time retrieval service

If all of these are measured as one "API latency" number, the team cannot fix the product.

For example, a slow vector database search might not need to block the first token. The system can stream an initial acknowledgement, retrieve context asynchronously, or use cached summaries. A slow billing write should almost never block generation. A failed analytics job should never break the user's answer.

The local-first iOS version of the same idea

This architecture is not only for web apps. I use the same separation in iOS on-device AI work:

  1. SwiftUI owns user intent.
  2. A view model validates and creates a task.
  3. A model actor streams local output.
  4. SwiftData persists final state.
  5. Background workers index or summarize later.
@MainActor
final class ChatViewModel: ObservableObject {
    @Published var transcript: [TokenChunk] = []
    private let engine: LocalGenerationEngine
    private let store: ConversationStore

    func send(_ text: String) {
        Task {
            let messageID = try await store.createPendingMessage(text)

            for try await chunk in engine.streamReply(to: text) {
                transcript.append(chunk)
            }

            try await store.finalize(messageID)
        }
    }
}

In a more rigorous version, finalize does not block UI streaming. The same principle applies: do not make a live generation path responsible for every side effect.

Cancellation is an architectural feature

AI products need cancellation. Users change their mind. They navigate away. They submit a better prompt. They close the app.

If one endpoint owns everything, cancellation becomes dangerous. Did the model stop? Did credits decrement? Did the partial message save? Did retrieval finish? Did the UI receive a final event?

I model stream state explicitly:

type GenerationState =
  | { type: "queued" }
  | { type: "streaming"; startedAt: string }
  | { type: "completed"; tokenCount: number }
  | { type: "cancelled"; reason: "user" | "timeout" }
  | { type: "failed"; code: string }

This state machine is boring. It is also what lets the product recover cleanly.

The failed pattern: saving every token synchronously

Some apps persist every streamed token directly to the database. That sounds safe because the transcript survives refreshes. It can also destroy performance.

Better patterns:

  1. Buffer chunks on the client and send periodic snapshots.
  2. Persist final text after completion.
  3. Persist coarse checkpoints for very long generations.
  4. Store raw provider events in cheaper append-only storage if auditability is required.

The right answer depends on the product. A legal drafting tool needs stronger durability than a playful chat UI. But neither should let database writes randomly pause the token stream.

The engineering conclusion

An AI wrapper becomes a real product when generation is no longer treated as a single HTTP request.

The minimum serious architecture separates:

  1. Command validation.
  2. Stream delivery.
  3. Retrieval and context assembly.
  4. Durable state writes.
  5. Billing and analytics.
  6. Cancellation and failure recovery.

This design does not make the model smarter. It makes the product less fragile. That is usually the difference between a demo people admire and a tool people keep open.