This one is a teaser. This has to be built in some better server and can be served to huge userbase. Its a tesseract OCR built on an open platform. This is basically the final nail in the coffin of troubles that is the “Legacy/Unicode Devanagari Text”.
This basically scans your image or pdf and uses open source OCR recognition from Tesseract. Tesseract is an Open Source OCR Engine. This then utilizes the library data and training data from Tesseract to scan your image/pdf or even handwritten texts to produce unicode text results. It doesn’t matter if you are starting with legacy texts like “Preeti” rather than unicode. This works equally well.
Here is your sample HTML and JavaScript for this project. To render huge pdf files into OCR, an alternative, less memory hungry method may can be devised. Or a better computer. Till then you can use the below codes to use for your purpose. This pulls the Tesseract script and runs on user side.
Feel free to utilize this in any way you seem fit as I do not own this. : )
Courtesy of Open Source OCR Engine – Tesseract :
GitHub – tesseract-ocr/tesseract: Tesseract Open Source OCR Engine
Leave a Reply