Research: OCR/ICR/LVLM for digitization of text

Research: OCR/ICR/LVLM for digitization of text

Description

Simplest steps forward to create a searchable ʻŌlelo Noʻeau database.

  • collect the text: gather all the ʻŌlelo Noʻeau (have the book, would be cool to include audio forms found in ulukau)
  • OCR will recognize characters and convert them into editable formats like .txt etc.
  • manual correction: this is important for accuracy
  • tagging and metadata: after the text is digitized, add metadata such as place, meaning, themes, and koana
  • database organization: store the digitized and tagged proverbs in a well-structured database (e.g., MySQL, PostgreSQL) with fields for the proverb text, its meaning, metadata, and related themes.
  • searchable system: implement a searchable system, use full-text search or indexing methods to enable efficient querying

Findings and Reflections

What did you learn?

Decoding OCR: A Comprehensive Guide
Explore comprehensive OCR technology: from key metrics and preprocessing techniques to advanced models like Surya-OCR.

great breakdown of what's going on when you're using OCR. i didn't go into all the other technologies, but it's fascinating, and gets me thinking about how our own brains breakdown meaning from handwritten language and assign them memories in some instances.

  • ocr (optical character recognition) - example: tesseract ocr API (open-source, supports multiple languages)
  • icr (intelligent character recognition - handwriting recognition) - example: aws textract (supports handwriting recognition) and Omni AI.
  • lvlm / dnn / dtr (large vision-language models + deep neural networks + deep text recongition) - example: gpt-4v (vision-enabled text processing)
  • speech-to-text (for audiobooks & narration) - example: whisper (multilingual, high accuracy)
  • scanning hardware & mobile apps - example: adobe scan (mobile scanning with ocr)
  • edge ai & embedded ocr - example: opencv + tesseract (for real-time ocr on edge devices)
  • crowdsourced & hybrid approaches - example: recaptcha (google's crowdsourced digitization)
  • blockchain & decentralized digital archives - example: ipfs (decentralized file storage for digital books)

Combining approaches gets us to quicker self correction, improved formatting, language handling, and large amounts of post-processing. I think Omni AI got me from point A to B the quickest, with the least friction.

formatted word document
formatted markdown document

Other Thoughts

  • Curious how libraries / archives go about doing this work now. It seems really disjointed—research says this, archive bought this outdated software, academia is ahead of museums etc.
  • In order to make this iron clad, I think making it an open source project, like Wikipedia would make sense. You could invite people with the expertise to contribute to a system everyone benefits from.
  • More Hawaiian 'ike needs to be digitized. It's like the printing press—we need to get this knowledge online, archived, and stored somewhere safe. If they burn down all the libraries / servers, we need to have backups.
  • Allegedly, US, UK, Germany, China, and France are leaders in putting resources towards digitization of IP and national archives.

Read more