How I Passed Guideline 5.1.1 With Local Data Flow and Optional Registration
A Guideline 5.1.1 privacy and data-collection case study for local-first document AI.
The Easy PDF Pro review thread did not question model quality first; it questioned data handling assumptions. The reviewer wanted explicit confirmation that common analysis flows do not silently export document text to remote endpoints.
Forum discussions under Guideline 5.1.1 show the same failure mode across many apps: if reviewers suspect registration is forced for non-account features, or privacy behavior is unclear, the submission is blocked even when core AI output quality is acceptable.
Users do not import generic text into a PDF tool. They import contracts, invoices, lecture notes, medical instructions, research papers, resumes, tax documents, and private drafts. If the app sends every page to a remote model by default, the privacy boundary becomes the product's weakest point.
Easy PDF Pro was designed around a stricter premise: the document analysis path should run locally whenever the requested operation is narrow enough to fit on device.
That sounds simple in marketing language. In engineering terms, it requires a full local pipeline: file access, parsing, chunking, indexing, retrieval, prompt construction, local inference, result storage, deletion, and error recovery. If any one of those pieces silently falls back to a remote service, the privacy story collapses.
Privacy Trigger Matrix (Document AI Apps)
| Common trigger in document AI | Guideline pressure | How I make it review-safe |
|---|---|---|
| Login is required before non-account document parsing features | 5.1.1 | Allow baseline non-account usage before registration |
| Remote upload path is ambiguous in normal analysis flow | 5.1.1 | Document and enforce local-only default path |
| No clear deletion and data lifecycle controls | 5.1.1 | Expose explicit delete controls and retention policy in UI |
| Privacy claims in listing exceed what runtime actually guarantees | 2.3.1 + 5.1.1 | Align privacy claims with observed data path behavior |
The document pipeline
The first engineering decision is to separate file identity from extracted content.
import SwiftData
@Model
final class PDFDocumentRecord {
@Attribute(.unique) var id: UUID
var displayName: String
var bookmarkData: Data?
var importedAt: Date
var pageCount: Int
var localOnly: Bool
init(displayName: String, bookmarkData: Data?, pageCount: Int) {
self.id = UUID()
self.displayName = displayName
self.bookmarkData = bookmarkData
self.importedAt = .now
self.pageCount = pageCount
self.localOnly = true
}
}
The record does not need to store the entire PDF as a blob. Depending on the import model, the app can copy the file into its container or keep a security-scoped bookmark. What matters is that the analysis layer receives a local file handle and never uploads document bytes for the local path.
Extraction uses a worker actor:
actor PDFExtractionWorker {
func extractPages(from url: URL) throws -> [ExtractedPage] {
guard let document = PDFDocument(url: url) else {
throw PDFError.unreadableDocument
}
return (0..<document.pageCount).compactMap { index in
guard let page = document.page(at: index) else { return nil }
return ExtractedPage(
pageNumber: index + 1,
text: page.string ?? ""
)
}
}
}
PDF extraction is not glamorous, but it defines the upper bound of answer quality. If the text layer is poor, the model will hallucinate around missing structure.
Chunking is a product decision
For local inference, context is scarce. A naive "put the whole PDF into the prompt" approach fails quickly. Chunking decides what the model can see.
I use page-aware chunks with overlap:
struct TextChunk: Codable, Hashable {
let id: UUID
let documentID: UUID
let pageRange: ClosedRange<Int>
let text: String
let tokenEstimate: Int
}
The chunker balances three constraints:
- The chunk must be small enough for local context.
- It must preserve citations back to page numbers.
- It must not split tables, headings, or lists too aggressively.
For a small model, I prefer shorter chunks and stricter answer formats. The model should answer from retrieved passages, not improvise a document-wide summary from memory.
struct RetrievalResult {
let chunk: TextChunk
let score: Double
}
actor LocalRetriever {
func topChunks(for query: String, documentID: UUID, limit: Int) async throws -> [RetrievalResult] {
// A real implementation can use embeddings, BM25, or a hybrid strategy.
// The key is that retrieval happens locally against local chunks.
try await index.search(query: query, documentID: documentID, limit: limit)
}
}
The local answer contract
The prompt contract is intentionally conservative:
<task>
answer_question_from_pdf_context
</task>
<rules>
- Use only the provided context.
- If the answer is not present, say that the document section was not found.
- Include page citations.
- Return JSON only.
</rules>
<context>
...
</context>
<question>
...
</question>
<json>
{"answer":
The output type is explicit:
struct PDFAnswer: Decodable {
let answer: String
let citedPages: [Int]
let confidence: Confidence
let missingContext: Bool
}
enum Confidence: String, Decodable {
case low, medium, high
}
This is the difference between a document assistant and a chatbot with a PDF icon. A document assistant must know when the answer is not in the retrieved text.
Privacy engineering means deletion engineering
A local-first app still needs a deletion model. "We do not upload files" is not enough if the app leaves extracted chunks, summaries, and chat history scattered across local storage.
I model derived data explicitly:
| Data | Location | User control |
|---|---|---|
| Imported PDF | app container or security-scoped source | remove document |
| Extracted text chunks | SwiftData / local index | deleted with document |
| Summaries | SwiftData | clear per document |
| Chat history | SwiftData | clear per document or all |
| Diagnostics | local logs or system crash reports | no document content |
Deletion should be implemented as a graph operation:
actor DocumentDeletionWorker {
let context: ModelContext
func deleteDocument(id: UUID) throws {
try deleteChunks(documentID: id)
try deleteAnswers(documentID: id)
try deleteSummaries(documentID: id)
try deleteFileIfManaged(documentID: id)
try deleteRecord(documentID: id)
try context.save()
}
}
The hard part is not the code. The hard part is remembering that generated artifacts are also user data.
Performance budget
A local PDF assistant has several latency points:
| Stage | Target for good UX | Notes |
|---|---|---|
| File import | under 1s for small PDFs | large scans are different |
| Text extraction | under 100ms/page when possible | depends on PDF structure |
| Chunk indexing | background after import | should not block reading |
| Question retrieval | under 150ms | local index |
| Local answer generation | 300ms-2s | depends on model and device |
I do not try to make every stage instant. I try to put each stage in the right part of the user journey. Extraction can happen after import. Indexing can continue while the user reads. Generation should feel responsive once the user asks.
The architecture conclusion
Private document AI is not achieved by adding the word "private" to onboarding. It is achieved by making the local path complete.
The engineering checklist is:
- Local file access with clear ownership.
- Local text extraction and chunking.
- Local retrieval with citations.
- A constrained local generation contract.
- Explicit storage for derived artifacts.
- Deletion that covers documents, chunks, summaries, and history.
- No silent remote fallback for document content.
This is why on-device AI matters for PDF workflows. The app can become useful without asking the user to donate private documents to a server.