OCR Landscape 2026 — Recap cho Lumina-Wiki Phase C

Brainstorm Local Doc Ingestion · 2026-05-07 · Audience target: phổ thông · Constraint: zero-bundle, opt-in, no native deps

Quyết định: Default Mistral OCR 3 API cho opt-in tier. Free tier dùng Vision-passthrough của agent host (Phase B). Tesseract, EasyOCR, Docling loại khỏi roadmap. Self-host olmOCR-2 chỉ document tham khảo cho power user — không bundle.

1. Bảng so sánh tổng — 12 lựa chọn OCR 2026

★★★★★ phù hợp Lumina ★★★ trade-off ★ không phù hợp CHOSEN đã chốt

Phương án	Loại	Quality (scanned)	Giá / 1K pages	Install friction	Markdown out	Multilingual	Privacy	Verdict Lumina
Vision-passthrough Claude/Gemini/GPT host	VLM của host	90–94%	$0 (token host)	Zero	Native	Excellent	Theo host user	PHASE B
Mistral OCR 3 API	Cloud VLM-OCR	SOTA (96.6% tables, 88.9% handwriting)	$1 (batch)	API key	Native	Excellent	Cloud (EU option)	PHASE C DEFAULT
Google Vision	Cloud OCR	Top printed media	$1.50 (1K free/mo)	API key + GCP project	Không	Tốt	Cloud	★★★ tier zero-cost option
Azure Document Intelligence	Cloud OCR	Solid baseline	$1.50 – $15	API key + Azure	Có	Tốt	Cloud	★★ thua Mistral mọi mặt
AWS Textract	Cloud OCR	Tables/forms tốt	$1.50 – $65	AWS account + IAM	JSON, không markdown	Tốt	Cloud	★ đắt 65× Mistral
olmOCR-2 (Allen AI)	Self-host VLM	SOTA (82.4 olmOCR-Bench)	$0 + GPU điện	GPU + 15GB model	Native	Excellent	Local	★★ document, không bundle
DeepSeek-OCR 3B MoE, 570M active	Self-host VLM + context compression	Vượt MinerU2.0 với ít token hơn (OmniDocBench)	$0 + GPU điện	GPU A100, model weights HF	Có	Tốt	Local	★★ document, novelty: optical context compression
PaddleOCR-VL 7B	Self-host VLM	Top OmniDocBench (92.86)	$0.09 self-host	GPU + Paddle stack	Có	Excellent (CN strong)	Local	★★ document only
Docling (IBM)	Pipeline OSS	Tốt digital-born, kém scanned	$0	Python + 150MB Granite model	Native	Trung bình	Local	★ kém scanned, install nặng
Surya / Marker (datalab)	Pipeline OSS	Layout tốt, 90+ ngôn ngữ	$0	Python ML stack	Có	Excellent	Local	★ dev tool, không non-tech
EasyOCR / PaddleOCR (non-VL)	OCR cổ điển + ML	~75–85%	$0	40MB models auto-pull	Không	Tốt	Local	★ thua VLM-OCR rõ rệt
Tesseract (pytesseract)	OCR truyền thống	60–75%	$0	Binary cài tay (brew/apt/choco)	Không	Cần tessdata	Local	★ owner reject

2. So sánh giá — chi phí thực tế 1K & 100K pages/month

Service	1K pages	10K pages	100K pages	50K invoices/month (advanced)	Note
Mistral OCR 3 (batch)	$1	$10	$100	$50	Markdown native, 96.6% tables
Google Vision	Free (tier)	$13.50	$148.50	~$200	1K free/feature/mo vĩnh viễn
Azure DI	$1.50	$15	$150	~$300	Layout add-on tốn thêm
AWS Textract advanced	$65	$650	$6,500	$3,250	97% đắt hơn Mistral
olmOCR self-host (GPU $0.5/h)	~$2 (điện)	~$20	~$200	~$100	Cộng setup time + GPU hardware

3. Decision matrix — runtime của `/lumi-ingest`

Input: ảnh / scanned PDF │ ├── docs-ocr pack chưa cài? │ └─→ Vision-passthrough (Phase B) [DEFAULT, FREE] │ ├── docs-ocr pack cài + MISTRAL_API_KEY trong .env? │ └─→ Mistral OCR 3 API [PHASE C DEFAULT] │ └── docs-ocr pack cài + LUMINA_OCR_ENDPOINT trong .env? └─→ Self-host endpoint (olmOCR / PaddleOCR-VL) [POWER USER]

3-tier rõ ràng. Free path luôn có. Opt-in nâng cấp khi cần volume / structured output / privacy cloud-EU.

4. Tại sao Mistral OCR 3 — 6 lý do brutal

1. Rẻ nhất thị trường

$1/1K pages batch. AWS Textract advanced $65 — chênh 65×. 50K invoices/tháng: $50 vs $3,250.

2. Quality top-tier

96.6% tables (vs Textract 84.8%), 88.9% handwriting (vs Azure 78.2%). Lead double-digit.

3. Markdown native

Output trực tiếp dùng được trong wiki, không phải parse JSON như Textract / Azure.

4. Multilingual mạnh

VI / EN / ZH OK ngay. Không cần tessdata, không cần model tải về.

5. Pattern đã có sẵn

.env + MISTRAL_API_KEY giống research pack. /lumi-ocr-setup validate key trong <5s.

6. Free tier để thử

User test trước khi nạp tiền. Không bắt cam kết.

5. Tại sao loại các tùy chọn khác

Phương án	Lý do loại
Tesseract	Owner đã reject. Quality 60-75%, binary cài tay phá audience phổ thông, output plain text không cấu trúc.
EasyOCR / PaddleOCR (non-VL)	Auto-pull 40MB model — opaque, không transparent. Quality thua VLM-OCR thế hệ 2026.
Docling	Markdown-first nhưng kém scanned/handwritten — exactly use case của OCR pack.
Surya / Marker / Nougat / MinerU	Dev tool. Cần Python ML stack, GPU khuyến nghị. Non-tech user không cài được.
AWS Textract	Đắt 15-65× Mistral với quality không hơn. Setup IAM phức tạp.
Azure DI	Đắt hơn, quality thua Mistral, account Azure rườm rà cho non-tech.
Google Vision	Free tier 1K/feature/month hấp dẫn — có thể document như "alternative miễn phí" trong spec, không phải default.
olmOCR / PaddleOCR-VL self-host	SOTA quality nhưng cần GPU + 5-15GB model. Document cho power user, không bundle.
DeepSeek-OCR	Novelty mới (context optical compression — render text thành ảnh để nén KV cache LLM). Là OCR mạnh nhưng cần A100, chưa có hosted API. Document tham khảo cho power user, không bundle. Hướng "context compression" nằm ngoài scope Lumina.

6. Unresolved questions cho spec Phase C

Mistral data residency cho EU user? (Quan trọng cho privacy claim của Lumina.)
Vision-passthrough Phase B có rate-limit ảnh/turn của Claude Code / Gemini CLI không?
Có cần fallback Google Vision free-tier (1K/feature/month) cho user zero-cost không có Mistral key?
Benchmark VI documents trên Mistral OCR — có pre-test cần làm trước khi chốt không?
Mistral terms cho data retention API tier (có train trên input không)?

7. Tham khảo

Mistral OCR — chính thức

Benchmark & review

Self-host alternatives

olmOCR (Allen AI)
DeepSeek-OCR GitHub
DeepSeek-OCR paper (arXiv 2510.18234) — context optical compression
BentoML: DeepSeek-OCR explained
Surya GitHub
Modal: 8 Top Open-Source OCR
Unstract: Best Open Source OCR 2026

Recap source: plans/260507-local-doc-ingestion-brainstorm/research/ocr-landscape-2026.md · Self-contained HTML, no external assets · Generated 2026-05-07