Yes to गजल, गीत र कविता but this post isn’t about ghazals, songs, or poems, although these undoubtedly are beautiful forms of expression. Earlier this year, when I tried to start writing poems – a failure attempt, admittedly – I noticed a lack of a specific useful tool. Most of the time I kept searching for words that rhyme with the ones I had in mind instead of tapping into the spontaneous creativity that poetry demands. The tedious experience of finding fitting rhyming words – a frustrating experience of a typical no good creator !!
So, I decided – let’s shift the focus and discuss the technical side of rhyming words: Tokenization, String Operations, Regular Expressions, Morphological Segmentation, Transliteration, Suffix Similarity, Fuzzy String Matching, and other tools, more used by NLP analysts rather than poets. Our goal here is to build a tool that can find rhyming words for any given Nepali word, converting the creative process to a mechanical one and in an effort to make even creative pursuits like writing a poem entirely unappealing !!
Okay jokes aside, if you are interested in only using the tool – skip to String Matching section directly or use this link to use the tool directly: Rhyming Nepali Words
Obtaining the Shabdakosh dataset
First we need the corpus of data from where we will get our words. For this we are using the dictionary released by Nepal Pragya Pratisthan Nepali Brihat Shabdakosh 10th Edition
I used some power queries, optical character recognition and automation scripts to convert this Shabdakosh / Dictionary into the excel file and used some wordpress plugin to prepare a working dictionary app. If you want to try it out see the link to the dictionary here: Nepali Dictionary Application. It may not be relevant to discuss the data cleaning process of converting this PDF dictionary into the Excel file – but if you are interested in the data corpus in the Excel format – here is the link to the dataset in excel: Nepali Dictionary Excel Data Corpus. The added advantage of this data corpus is that it already has a translated and transliterated data of the meaning of the words – so it helps if you want to find and search the relevant Nepali words in Nepali, English or Transliterated / Romanized Nepali. The web application shared above also supports this search method.
Cleaning Up and Text Morphosis
This step focuses on cleaning the dataset referred to as the Nepali Dictionary Excel Data Corpus. The process involves the following tasks:
- Removing the Devanagari numerals from the Excel Data Corpus: In the given Excel Data Corpus some words, although having different meanings, are written in the same way and are labeled with numerals like १, २, ३ to distinguish the. In this step, we will remove the adjoining Devanagari numerals as they are not relevant to the phonetic matching process. The resulting column will be named “wordfit,” with the input column being “Word.”
- Using the ntr Python library for transliteration: Transliteration involves converting text from one script to another by swapping letters in a predictable manner. We will use the Nepali-to-Roman transliteration library for this purpose. You can find the library here: nepali-to-roman · PyPI. The resulting column will be named “transliterate,” with the input column being “wordfit.”
- The next step involves swapping phonetically similar letters. For example, ‘ी’ with ‘ि’, ‘ू’ with ‘ु’, ‘ई’ with ‘इ’, ‘ऊ’ with ‘उ’, ‘श’ with ‘स’, and so on. Although these letters are different, they sound similar, which will help us match similar-sounding words more reliably. The resulting column will be named “phoneticMorph,” with the input column being “wordfit.”
- We will again transliterate the Nepali words in the “phoneticMorph” column using the transliteration Python library. The resulting column will be called “morphTransliterate.” While this step may seem redundant, it aims to minimize the chances of missing any similar words that emerged after the phonetic transformations applied in the previous step.
- In this step, we will create a list of synonyms for each word in the Excel data corpus. We will use a JSON file of synonyms that has already been prepared and is available for free on GitHub. Although this step is not directly related to finding the phonetic match for each word, it will provide additional information and context for the users of the tool. You can find the synonym list here: Nepali_synset.
String Matching and Fuzzy Fits
The steps outlined above involve preparing the Excel dataset for use in the module. These data preparation steps are necessary only once; after creating the final Excel file, they do not need to be repeated. We will use the resulting final Excel file from Cleaning Up and Text Morphosis as the input for this step. In this step, we will implement string matching and fuzzy fit techniques to find phonetically similar or rhyming words.
We will apply the string matching techniques to each of the columns: “wordfit,” “transliterate,” “phoneticMorph,” and “morphTransliterate,” to maximize the identification of similar-sounding words. The approach for finding phonetic matches involves comparing word endings to identify similarities, focusing on suffix similarity to determine the closest matches. This method groups words with similar suffixes and processes them to find the best rhyming matches.
The core concept is straightforward. We will explore the details further in the step below, which is outlined in the Google Collaboratory link shared below.
Google Colaboratory Links
Colab for Cleaning Up and Text Morphosis
Link to the Google Colab here: Cleaning Up and Text Morphosis
Trivias
One interesting trivia that has been implemented in the Colab Tool above is:
1. Hover over the result and you will find the meaning of that particular word from the dictionary of Brihat Nepali Shabdakosh – in a tooltip section.
2. Hover over the result and you will also see the synonymous words for that word, if available – in a tooltip section.
3. If you click on any word in the result section, it will take you dynamically to the Google Translation link of that particular word.
its gold dada
this was the first time I actually enjoyed a clickbait haha
tqsm Abi – but there was no intention to clickbait : )