Privacy Engineering

How I Passed Guideline 5.1.1 With Local Data Flow and Optional Registration

A Guideline 5.1.1 privacy and data-collection case study for local-first document AI.

2026-04-03PrivacyAIEasy PDF ProiOSOn-Device AI

The Easy PDF Pro review thread did not question model quality first; it questioned data handling assumptions. The reviewer wanted explicit confirmation that common analysis flows do not silently export document text to remote endpoints.

Forum discussions under Guideline 5.1.1 show the same failure mode across many apps: if reviewers suspect registration is forced for non-account features, or privacy behavior is unclear, the submission is blocked even when core AI output quality is acceptable.

Users do not import generic text into a PDF tool. They import contracts, invoices, lecture notes, medical instructions, research papers, resumes, tax documents, and private drafts. If the app sends every page to a remote model by default, the privacy boundary becomes the product's weakest point.

Easy PDF Pro was designed around a stricter premise: the document analysis path should run locally whenever the requested operation is narrow enough to fit on device.

That sounds simple in marketing language. In engineering terms, it requires a full local pipeline: file access, parsing, chunking, indexing, retrieval, prompt construction, local inference, result storage, deletion, and error recovery. If any one of those pieces silently falls back to a remote service, the privacy story collapses.

PDF file Parser PDFKit Chunks local index Retriever top-k Local LLM SwiftData stores metadata, chunks, and user-controlled history
Figure 1: The privacy boundary is not a slogan. It is the complete path from imported file to generated answer staying inside the local application container.

Privacy Trigger Matrix (Document AI Apps)

Common trigger in document AIGuideline pressureHow I make it review-safe
Login is required before non-account document parsing features5.1.1Allow baseline non-account usage before registration
Remote upload path is ambiguous in normal analysis flow5.1.1Document and enforce local-only default path
No clear deletion and data lifecycle controls5.1.1Expose explicit delete controls and retention policy in UI
Privacy claims in listing exceed what runtime actually guarantees2.3.1 + 5.1.1Align privacy claims with observed data path behavior

The document pipeline

The first engineering decision is to separate file identity from extracted content.

import SwiftData

@Model
final class PDFDocumentRecord {
    @Attribute(.unique) var id: UUID
    var displayName: String
    var bookmarkData: Data?
    var importedAt: Date
    var pageCount: Int
    var localOnly: Bool

    init(displayName: String, bookmarkData: Data?, pageCount: Int) {
        self.id = UUID()
        self.displayName = displayName
        self.bookmarkData = bookmarkData
        self.importedAt = .now
        self.pageCount = pageCount
        self.localOnly = true
    }
}

The record does not need to store the entire PDF as a blob. Depending on the import model, the app can copy the file into its container or keep a security-scoped bookmark. What matters is that the analysis layer receives a local file handle and never uploads document bytes for the local path.

Extraction uses a worker actor:

actor PDFExtractionWorker {
    func extractPages(from url: URL) throws -> [ExtractedPage] {
        guard let document = PDFDocument(url: url) else {
            throw PDFError.unreadableDocument
        }

        return (0..<document.pageCount).compactMap { index in
            guard let page = document.page(at: index) else { return nil }
            return ExtractedPage(
                pageNumber: index + 1,
                text: page.string ?? ""
            )
        }
    }
}

PDF extraction is not glamorous, but it defines the upper bound of answer quality. If the text layer is poor, the model will hallucinate around missing structure.

Chunking is a product decision

For local inference, context is scarce. A naive "put the whole PDF into the prompt" approach fails quickly. Chunking decides what the model can see.

I use page-aware chunks with overlap:

struct TextChunk: Codable, Hashable {
    let id: UUID
    let documentID: UUID
    let pageRange: ClosedRange<Int>
    let text: String
    let tokenEstimate: Int
}

The chunker balances three constraints:

  1. The chunk must be small enough for local context.
  2. It must preserve citations back to page numbers.
  3. It must not split tables, headings, or lists too aggressively.

For a small model, I prefer shorter chunks and stricter answer formats. The model should answer from retrieved passages, not improvise a document-wide summary from memory.

struct RetrievalResult {
    let chunk: TextChunk
    let score: Double
}

actor LocalRetriever {
    func topChunks(for query: String, documentID: UUID, limit: Int) async throws -> [RetrievalResult] {
        // A real implementation can use embeddings, BM25, or a hybrid strategy.
        // The key is that retrieval happens locally against local chunks.
        try await index.search(query: query, documentID: documentID, limit: limit)
    }
}

The local answer contract

The prompt contract is intentionally conservative:

<task>
answer_question_from_pdf_context
</task>

<rules>
- Use only the provided context.
- If the answer is not present, say that the document section was not found.
- Include page citations.
- Return JSON only.
</rules>

<context>
...
</context>

<question>
...
</question>

<json>
{"answer":

The output type is explicit:

struct PDFAnswer: Decodable {
    let answer: String
    let citedPages: [Int]
    let confidence: Confidence
    let missingContext: Bool
}

enum Confidence: String, Decodable {
    case low, medium, high
}

This is the difference between a document assistant and a chatbot with a PDF icon. A document assistant must know when the answer is not in the retrieved text.

Privacy engineering means deletion engineering

A local-first app still needs a deletion model. "We do not upload files" is not enough if the app leaves extracted chunks, summaries, and chat history scattered across local storage.

I model derived data explicitly:

Data Location User control
Imported PDF app container or security-scoped source remove document
Extracted text chunks SwiftData / local index deleted with document
Summaries SwiftData clear per document
Chat history SwiftData clear per document or all
Diagnostics local logs or system crash reports no document content

Deletion should be implemented as a graph operation:

actor DocumentDeletionWorker {
    let context: ModelContext

    func deleteDocument(id: UUID) throws {
        try deleteChunks(documentID: id)
        try deleteAnswers(documentID: id)
        try deleteSummaries(documentID: id)
        try deleteFileIfManaged(documentID: id)
        try deleteRecord(documentID: id)
        try context.save()
    }
}

The hard part is not the code. The hard part is remembering that generated artifacts are also user data.

Performance budget

A local PDF assistant has several latency points:

Stage Target for good UX Notes
File import under 1s for small PDFs large scans are different
Text extraction under 100ms/page when possible depends on PDF structure
Chunk indexing background after import should not block reading
Question retrieval under 150ms local index
Local answer generation 300ms-2s depends on model and device

I do not try to make every stage instant. I try to put each stage in the right part of the user journey. Extraction can happen after import. Indexing can continue while the user reads. Generation should feel responsive once the user asks.

The architecture conclusion

Private document AI is not achieved by adding the word "private" to onboarding. It is achieved by making the local path complete.

The engineering checklist is:

  1. Local file access with clear ownership.
  2. Local text extraction and chunking.
  3. Local retrieval with citations.
  4. A constrained local generation contract.
  5. Explicit storage for derived artifacts.
  6. Deletion that covers documents, chunks, summaries, and history.
  7. No silent remote fallback for document content.

This is why on-device AI matters for PDF workflows. The app can become useful without asking the user to donate private documents to a server.