As usual, the featured image above doesn’t have much to do with the content below.
This is my attempt at html-izing a Bantawa language dictionary. A good friend gifted me a Bantawa dictionary, knowing my love for this language. Bantawa, a Kirati language, belongs to the Tibeto-Burman (भोटबर्मेली परिवार) language family. You can explore more on this topic here: Oldest Friends from Flatlands, Hills, and Beyond.
Bantawa has primarily been a spoken language, and since no standardized form has been fully developed, multiple dialects like Dilpali, Amchoke, Hatuwali, and Dhankute exist. Among the Kirati / Rai languages, Bantawa is one of the most widespread, with a significant number of speakers. It is spoken across Nepal, including Bhojpur, Khotang, Dhankuta, Panchthar, Ilam, Morang, Sunsari, Udayapur, and even in Kathmandu Valley. Beyond Nepal, Bantawa is also spoken in parts of India (Darjeeling, Kalimpong, Sikkim), Bhutan, and Myanmar.
In this blog, the discussion will be centered around the latest “freeloader” approach to digitizing the images for best OCR recognition in the technical process of data processing involved in structuring text data for efficient retrieval, and preparing an HTML set.
But first type something in the box below and see the Bantawa – Nepali – English meaning.
Step 1: Getting the Data Corpus
This Bantawa Dictionary was a gift, and therefore, with it came the responsibility of taking on this project. It took me nearly three weeks to fully digitize it into an HTML format. Initially, I relied on Google Lens for text recognition, but recently its automation was too restricted for web scraping-like tasks. That led me to explore alternatives, and after a frustrating search, I discovered that Bing Lens had improved significantly – sometimes even outperforming Google Lens. This was a relief since commercial OCR APIs were out of my budget.
Step 2: Image Manipulation
The next step was preparing the images for OCR. The dictionary’s content was laid out in a two-column format, so there had to be a way to split the pages down the middle for better text extraction. This involved training an AI-based image processor using the first 50 pages, and surprisingly, it achieved 100% accuracy in splitting all 800 pages. The reason for this approach? Smaller, column-separated images improved OCR accuracy dramatically.
Step 3: Running It Through an OCR Engine
Preparing the images was just the beginning – the real challenge was getting high-quality text extraction. This required a crude but effective approach: Preprocessing the images – cropping, enhancing contrast, and ensuring clarity. Automating OCR with Bing Lens – since there was no direct API, I used Selenium automation in Chrome Debugging Mode to feed images into Bing Lens and extract the text. Here’s the Python script I used for this process:
'''
Start this chrome in debug mode instance
"C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="C:\chromedev"
'''
import os
import shutil
import fitz # PyMuPDF
from ftplib import FTP
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import random
import string
# folder location
pdf_folder = r'remaining'
done_pdfs_folder = r'completed'
done_text_folder = r'textfiles'
# Replace with your FTP server details
ftp_host = 'enteryourhostname'
ftp_user = 'enteryourcpanelusername'
ftp_pass = 'enteryourcpanelpassword'
ftp_upload_dir = 'enteryouruploaddirectory'
# Function to upload an image to an FTP server and get the URL with a unique name
def upload_to_ftp(image_path, ftp_host, ftp_username, ftp_password, ftp_upload_dir):
try:
ftp = FTP(ftp_host)
ftp.login(user=ftp_username, passwd=ftp_password)
ftp.cwd(ftp_upload_dir)
# Generate a unique prefix and rename the file before uploading
unique_prefix = generate_unique_prefix()
shortname = os.path.basename(image_path)
unique_name = f"{unique_prefix}_{shortname}"
unique_path = os.path.join(os.path.dirname(image_path), unique_name)
# Rename the local file temporarily
os.rename(image_path, unique_path)
# Upload the renamed file
with open(unique_path, 'rb') as file:
ftp.storbinary(f'STOR {unique_name}', file)
# Restore the original filename locally
os.rename(unique_path, image_path)
uploaded_url = f'https://{ftp_upload_dir}/{unique_name}'
print(f'Uploaded Image URL: {uploaded_url}')
ftp.quit()
return uploaded_url
except Exception as e:
print(f"Error uploading image: {str(e)}")
return None
# Function to calculate progress
def calculate_progress():
remaining_pdf_count = len([file for file in os.listdir(pdf_folder) if file.endswith('.pdf')])
done_pdf_count = len([file for file in os.listdir(done_pdfs_folder) if file.endswith('.pdf')])
total_pdf_count = remaining_pdf_count + done_pdf_count
if total_pdf_count == 0:
return 0.0
return done_pdf_count / total_pdf_count
def process_image(image_url, retries=100):
options = Options()
options.debugger_address = "127.0.0.1:9222" # Attach to existing Chrome instance
# Uncomment the next line to run in headless mode
# options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
bing_url = (
f"https://www.bing.com/images/search?view=detailv2&iss=sbi&form=SBIVSP&sbisrc=UrlPaste"
f"&q=imgurl:{image_url}&selectedindex=0&id={image_url}&mediaurl={image_url}&exph=0&expw=0&vt=2&sim=10"
)
attempt = 0
result_text = None
while attempt < retries:
driver.get(bing_url)
print("attempt: " + str(attempt))
try:
# Wait for the textarea or check for "Can't find text"
WebDriverWait(driver, 15).until(lambda d:
d.find_elements(By.XPATH, "/html/body/div[2]/div/div/div[2]/div/div[2]/ul/li/div/div[2]/textarea") or
"Can't find text" in d.page_source
)
# If "Can't find text" is present, refresh immediately
if "Can't find text" in driver.page_source:
attempt += 1
continue
# Try to get the result text
result_textarea = driver.find_element(By.XPATH, "/html/body/div[2]/div/div/div[2]/div/div[2]/ul/li/div/div[2]/textarea")
result_text = result_textarea.get_attribute("value")
if result_text.strip():
break
except (TimeoutException, NoSuchElementException):
pass
attempt += 1
driver.quit()
return result_text
# Function to generate a unique 5-letter prefix
def generate_unique_prefix(length=5):
return ''.join(random.choices(string.ascii_lowercase + string.digits, k=length))
# Updated function to remove image from local folder and FTP
def remove_image(image_path, uploaded_url):
try:
# Remove local image
os.remove(image_path)
# Extract the unique filename from the uploaded URL
unique_filename = os.path.basename(uploaded_url)
with FTP(ftp_host) as ftp:
ftp.login(user=ftp_user, passwd=ftp_pass)
ftp.cwd(ftp_upload_dir)
ftp.delete(unique_filename)
print(f"Deleted {unique_filename} from FTP")
except Exception as e:
print(f"Error removing image: {str(e)}")
# Function to extract images from a PDF
def extract_images_from_pdf(pdf_path, image_folder):
pdf_document = fitz.open(pdf_path)
for page_num in range(pdf_document.page_count):
page = pdf_document[page_num]
image = page.get_pixmap(matrix=fitz.Matrix(125/72, 125/72))
image_path = os.path.join(image_folder, f'page{page_num + 1}.png')
image.save(image_path)
pdf_document.close()
# Main function
def process_pdfs(pdf_folder):
for pdf_file in os.listdir(pdf_folder):
if pdf_file.endswith('.pdf'):
pdf_name = os.path.splitext(pdf_file)[0]
pdf_path = os.path.join(pdf_folder, pdf_file)
# Create folder for images
image_folder = os.path.join(pdf_folder, pdf_name)
os.makedirs(image_folder, exist_ok=True)
# Create folder for text files
text_folder = os.path.join(done_text_folder, pdf_name)
os.makedirs(text_folder, exist_ok=True)
# Extract images from the PDF
extract_images_from_pdf(pdf_path, image_folder)
# Process each image
image_files = [f for f in os.listdir(image_folder) if f.endswith('.png')]
image_files.sort(key=lambda x: int(x.split("page")[1].split(".png")[0]))
for image_file in image_files:
image_path = os.path.join(image_folder, image_file)
# Upload image to FTP and get URL
image_url = upload_to_ftp(image_path, ftp_host, ftp_user, ftp_pass, ftp_upload_dir)
if image_url:
print(image_url)
# Run OCR on the image
extracted_text = process_image(image_url)
if extracted_text:
remove_image(image_path, image_url)
# Create separate text file for each page
page_number = image_file.split('page')[1].split('.png')[0]
txt_file_name = f'page{page_number}.txt'
txt_file_path = os.path.join(text_folder, txt_file_name)
# Save extracted text in a separate file
with open(txt_file_path, 'w', encoding='utf-8') as txt_file:
txt_file.write(extracted_text)
txt_file.write('\n')
# Move the processed PDF to the "done pdfs" folder
shutil.move(pdf_path, os.path.join(done_pdfs_folder, pdf_file))
# Remove the temporary image folder if it's empty
if not os.listdir(image_folder):
os.rmdir(image_folder)
# Calculate and print progress
progress = calculate_progress()
print(f'#######################################################################################################Progress: {progress * 100:.2f}%')
if __name__ == "__main__":
process_pdfs(pdf_folder)
Step 4: Data Processing and Cleaning
Once I had the raw text, the real work began—cleaning and structuring the data. This involved: Compiling everything into spreadsheets and Using advanced Excel formulas and macros to remove inconsistencies and format the text properly. It was a tedious process, but if you enjoy data wrangling, it’s oddly satisfying.
Step 5: Wrapping It in HTML & Moving On
With the data cleaned and structured, the final step was wrapping everything in HTML format and making it searchable and user-friendly.
And with that, the project was finally complete – time to move on to the next one !
Leave a Reply