Research: OCR/ICR/LVLM for digitization of text

Leimomi Bong

24 Mar 2025 — 3 min read

Description

Simplest steps forward to create a searchable ʻŌlelo Noʻeau database.

collect the text: gather all the ʻŌlelo Noʻeau (have the book, would be cool to include audio forms found in ulukau)
OCR will recognize characters and convert them into editable formats like .txt etc.
manual correction: this is important for accuracy
tagging and metadata: after the text is digitized, add metadata such as place, meaning, themes, and koana
database organization: store the digitized and tagged proverbs in a well-structured database (e.g., MySQL, PostgreSQL) with fields for the proverb text, its meaning, metadata, and related themes.
searchable system: implement a searchable system, use full-text search or indexing methods to enable efficient querying

Findings and Reflections

What did you learn?

great breakdown of what's going on when you're using OCR. i didn't go into all the other technologies, but it's fascinating, and gets me thinking about how our own brains breakdown meaning from handwritten language and assign them memories in some instances.

ocr (optical character recognition) - example: tesseract ocr API (open-source, supports multiple languages)
icr (intelligent character recognition - handwriting recognition) - example: aws textract (supports handwriting recognition) and Omni AI.
lvlm / dnn / dtr (large vision-language models + deep neural networks + deep text recongition) - example: gpt-4v (vision-enabled text processing)
speech-to-text (for audiobooks & narration) - example: whisper (multilingual, high accuracy)
scanning hardware & mobile apps - example: adobe scan (mobile scanning with ocr)
edge ai & embedded ocr - example: opencv + tesseract (for real-time ocr on edge devices)
crowdsourced & hybrid approaches - example: recaptcha (google's crowdsourced digitization)
blockchain & decentralized digital archives - example: ipfs (decentralized file storage for digital books)

Combining approaches gets us to quicker self correction, improved formatting, language handling, and large amounts of post-processing. I think Omni AI got me from point A to B the quickest, with the least friction.

Other Thoughts

Curious how libraries / archives go about doing this work now. It seems really disjointed—research says this, archive bought this outdated software, academia is ahead of museums etc.
In order to make this iron clad, I think making it an open source project, like Wikipedia would make sense. You could invite people with the expertise to contribute to a system everyone benefits from.
More Hawaiian 'ike needs to be digitized. It's like the printing press—we need to get this knowledge online, archived, and stored somewhere safe. If they burn down all the libraries / servers, we need to have backups.
Allegedly, US, UK, Germany, China, and France are leaders in putting resources towards digitization of IP and national archives.

Local hardware fine-tuning LLMs for Hawaiian-English translation benchmarking

People * David Idea Exploring memory-efficient fine-tuning techniques for improving Hawaiian-to-English translation using Apple's MLX framework, comparing multiple approaches and optimizing for Mac hardware. Details * Successfully fine-tuned gemma-3-4b-it-4bit on Mac M1 Ultra (128GB RAM) achieving 0.8296 semantic similarity score, a 3.6% improvement over the base model * Discovered

Fine-tuning performance between Apple and Nvidia

People: * David Idea: * Comparing fine-tuning performance on MacBook M3 Max, Mac Studio M1 Ultra, and Nvidia 4090 using MLX and Unsloth Details: * Tested fine-tuning Phi-3-mini-4k-instruct model * Followed this Jan 2025 MLX guide for Apple hardware * Used Unsloth library for Nvidia GPU * Dataset had 627 examples and used 500 training steps

Benchmarking Runpod cloud GPUs

People David Idea Exploring cloud GPU performance, cost, and usability for running open-source AI models (in comparison to local hardware and in context of recent learning on software to handle concurrent users) Details * Compared RunPod serverless and pods using Nvidia (vLLM) and AMD (sglang) * Benchmarked Nvidia 3090, 4090, RTX 6000

Comparing Onyx vs Morphik (whole PMF Deep Research)

People: * David Idea: * Comparing two open-source projects, Onyx and Morphik, for self-hosted semantic search (and deep research) across Google Drive files (PDFs, Docs, Sheets, and Slides) Details: * Onyx: Great text-based search, flexible embedding models * Morphik: Powerful multi-modal search (text, images, graphs) * Onyx easily connects and syncs with Google Drive * Morphik