Need for finding semantic similarity in Excel
When we are cleaning up data in Excel, the built-in identifying and removing duplicates tool is great for spotting exact matches, but we need to do better than that. Because the real‑world text data is messy and finding exact match for duplicate identification is not enough because of the following reasons:
- Typos and variations: Two entries like “Shreedevi Kapoor” vs. “Shreedevi Kapur” won’t be caught as duplicates even though they likely refer to the same person.
- Synonyms and paraphrases: Phrases like “purchase order” vs. “order for purchase” or “NYC” vs. “New York City” looks different character‑for‑character, yet carries the same meaning.
- Transpositions and small edits: Swapped letters (“adress” vs. “address”) or small insertions/deletions can hide duplicates from simple exact‑match filters.
A semantic similarity finder lets us score how close two strings are on a scale from 0 (completely different) to 1 (identical). That score can then drive fuzzy‑matching pipelines – flagging likely duplicates even when the text doesn’t match exactly, reducing manual review and improving the quality of the data we are working with.
Approaches Used in the VBA Code
The custom SemanticSimilarity(text1, text2, method) function supports a wide range of algorithms. Below is a quick overview of each, with a small illustration or “visual” tip for how they work:
Method | What It Does | Example |
---|---|---|
Levenshtein | Counts the minimum single‐character edits (insertions, deletions, substitutions) to turn one string into another. | “kitten” → “sitting” requires 3 edits: k→s, e→i, +g |
Damerau‑Levenshtein | Like Levenshtein, but also allows swapping two adjacent characters as a single operation. | “ca” → “ac” is 1 transposition (swap), not 2 separate edits |
Optimal String Alignment (OSA) | A restricted version of Damerau‑Levenshtein where each substring can be edited at most once—prevents “double counting” overlapping edits. | “CA” → “ABC” still counts as 2 edits (insert A, substitute C→B), even though a naive DL might count more |
Hamming | Measures the percentage of positions that differ—but only valid when both strings are the same length. | “1011101” vs. “1001001” differ in 2 of 7 positions → similarity = 1 – 2/7 ≈ 0.714 |
Longest Common Subsequence (LCS) | Finds the longest sequence of matching characters (not necessarily contiguous), then divides its length by the longer string’s length. | “ABCDFG” & “AEDFHR” share subsequence “ADF” of length 3 → similarity = 3/6 = 0.5 |
Jaccard | Treats each string as a set of tokens; similarity = (size of the intersection) divided by (size of the union). | A = {“red”, “blue”}, B = {“blue”, “green”} → intersection has 1 token, union has 3 → score = 1 ÷ 3 ≈ 0.33 |
Sørensen‑Dice | Like Jaccard but gives twice as much weight to the shared tokens: similarity = 2 × (intersection size) ÷ (sum of the two set sizes). | Same A and B above → score = 2 × 1 ÷ (2+2) = 2 ÷ 4 = 0.5 |
Overlap Coefficient | Intersection size divided by the smaller set’s size—maximizes sensitivity when one set is much smaller. | A = {1,2,3}, B = {2,3,4,5} → int = {2,3}=2, smaller size = 3 → score = 2/3 ≈ 0.67 |
Jaro‑Winkler | Looks for matching characters within a “window” around each position and lightly penalizes transpositions; gives a bonus for common prefix. | “MARTHA” vs. “MARHTA”: most letters match within window, only one swap → score ≈ 0.961 |
Ratcliff‑Obershelp (Gestalt) | Recursively finds the longest common substring, then does the same on the text before and after that match, scoring by total matched length. | “ABCXYZABC” vs. “XYZABCXYZ” → longest common “XYZABC” length 6 → plus recursive checks on leftovers → overall similarity ≈ 0.67 |
N‑Gram (default n = 3) | Breaks each string into overlapping sequences of n characters and computes Jaccard over those n‑grams. | “data” → {“dat”,“ata”}; “date” → {“dat”,“ate”} → int={“dat”}=1, union=3 → score=1/3 ≈ 0.33 |
Q‑Gram (default q = 2) | Same as N‑Gram but with q=2, useful for shorter strings. | “hi” → {“hi”}, “hi!”→{“hi”,“i!”} → int={“hi”}=1, union=2 → score=1/2 = 0.5 |
Cosine | Converts each string into a word‑frequency vector and measures the cosine of the angle between them (dot product normalized). | “red blue red” vs. “blue green red” → vectors [red:2,blue:1] vs. [red:1,blue:1,green:1] → dot=2, |
Tversky | Generalizes Jaccard by weighting non‑shared tokens differently via α, β parameters. | A={a,b,c}, B={b,c,d,e}, with α = β = 0.5 → score = 2/(2 + 0.5×1 + 0.5×2) = 2/3.5 ≈ 0.57 |
Soundex | Encodes words by their English phonetic pattern; returns 1 if codes match exactly, else 0. | “Robert” and “Rupert” both → R‐163 → similarity = 1; “Ashcraft” vs. “Ashcroft” → both → A‐261 → similarity = 1 |
How to Call the Function: Argument Table
You can plug any of the above method names into the SemanticSimilarity UDF. Here’s a quick reference:
Method Name (string) |
Description / VBA Call Example |
“levenshtein” | =SemanticSimilarity(A1, B1, “Levenshtein”) |
“damerau-levenshtein” | =SemanticSimilarity(A1, B1, “Damerau-Levenshtein”) |
“osa” or “optimal string alignment” | =SemanticSimilarity(A1, B1, “OSA”) |
“hamming” | =SemanticSimilarity(A1, B1, “Hamming”) |
“longest common subsequence” or “lcs” | =SemanticSimilarity(A1, B1, “LCS”) |
“jaccard” | =SemanticSimilarity(A1, B1, “Jaccard”) |
“jaro-winkler” | =SemanticSimilarity(A1, B1, “Jaro-Winkler”) |
“ratcliff-obershelp” or “gestalt” | =SemanticSimilarity(A1, B1, “Ratcliff-Obershelp”) |
“n-gram” | =SemanticSimilarity(A1, B1, “N-Gram”) |
“q-gram” | =SemanticSimilarity(A1, B1, “Q-Gram”) |
“sörensen-dice” or “sorensen-dice” | =SemanticSimilarity(A1, B1, “Sorensen-Dice”) |
“overlap” | =SemanticSimilarity(A1, B1, “Overlap”) |
“cosine” | =SemanticSimilarity(A1, B1, “Cosine”) |
“tversky” | =SemanticSimilarity(A1, B1, “Tversky”) |
“soundex” | =SemanticSimilarity(A1, B1, “Soundex”) |
How to Install This VBA Add‑In
You can plug any of the above method names into the SemanticSimilarity UDF. Here’s a quick reference:
Step 1: Download the Excel Addin File from this link. Download
Step 2: Install the Addin File
(i) Open your “Microsoft Excel”
(ii) Go to “Files:
(ii) Go to “Options”
(iv) Go to “Add-ins” tab in the new window
(v) At the bottom at “Manage” Section choose “Excel Add-ins” and click “Go”
(vi) Click “Browse” and browse and select the file downloaded from Step 1
(vii) Click “OK” and return to Excel Spreadsheet
Examples / Screenshot
You can plug any of the above method names into the SemanticSimilarity UDF. Here’s a quick reference:


The preceding illustrations shows four specific methodologies drawn from our earlier discussion of fifteen semantic similarity techniques. Upon examining the outcomes, it becomes evident that these selected methods are adept at finding the subtle nuances of semantic proximity between near-similar words and semantically similar phrases. This ability to quantify such relationships is crucial for a range of analytical purposes, including the exploration of lexical semantics and the practical identification of redundant information that might require further analysis or removal to improve data quality and efficiency.
VBA Code
VBA code for the all the 15 semantic similarity finder is as follows for reference.
Option Explicit
' Main UDF: returns a similarity score [0..1] for the chosen method
Public Function SemanticSimilarity( _
ByVal text1 As String, _
ByVal text2 As String, _
ByVal method As String _
) As Double
Dim m As String
m = LCase(Trim$(method))
' Route to the appropriate algorithm
Select Case m
Case "levenshtein"
SemanticSimilarity = 1 - LevenshteinDistance(text1, text2) / _
Application.WorksheetFunction.Max(Len(text1), Len(text2))
Case "jaccard"
SemanticSimilarity = JaccardSimilarity(text1, text2)
Case "jaro-winkler"
SemanticSimilarity = JaroWinkler(text1, text2)
Case "damerau-levenshtein"
SemanticSimilarity = 1 - DamerauLevenshtein(text1, text2) / _
Application.WorksheetFunction.Max(Len(text1), Len(text2))
Case "cosine"
SemanticSimilarity = CosineSimilarity(text1, text2)
Case "n-gram"
SemanticSimilarity = NGramSimilarity(text1, text2, 3)
Case "sörensen-dice", "sorensen-dice"
SemanticSimilarity = SorensenDice(text1, text2)
Case "longest common subsequence", "lcs"
SemanticSimilarity = LCSSimilarity(text1, text2)
Case "hamming"
SemanticSimilarity = 1 - HammingDistance(text1, text2) / _
Application.WorksheetFunction.Max(Len(text1), Len(text2))
Case "q-gram"
SemanticSimilarity = QGramSimilarity(text1, text2, 2)
Case "overlap"
SemanticSimilarity = OverlapCoeff(text1, text2)
Case "ratcliff-obershelp", "gestalt"
SemanticSimilarity = RatcliffObershelp(text1, text2)
Case "osa", "optimal string alignment"
SemanticSimilarity = 1 - OptimalStringAlignment(text1, text2) / _
Application.WorksheetFunction.Max(Len(text1), Len(text2))
Case "soundex"
SemanticSimilarity = IIf(SoundexCode(text1) = SoundexCode(text2), 1, 0)
Case "tversky"
SemanticSimilarity = TverskyIndex(text1, text2, 0.5, 0.5)
Case Else
SemanticSimilarity = CVErr(xlErrNA)
End Select
End Function
'=========================
' Helper Implementations
'=========================
' 1. Levenshtein
Public Function LevenshteinDistance(s As String, t As String) As Long
Dim d() As Long
Dim i As Long, j As Long, cost As Long
Dim ls As Long, lt As Long
ls = Len(s): lt = Len(t)
ReDim d(0 To ls, 0 To lt)
For i = 0 To ls: d(i, 0) = i: Next i
For j = 0 To lt: d(0, j) = j: Next j
For i = 1 To ls
For j = 1 To lt
If Mid$(s, i, 1) = Mid$(t, j, 1) Then cost = 0 Else cost = 1
d(i, j) = Application.WorksheetFunction.Min( _
d(i - 1, j) + 1, _
d(i, j - 1) + 1, _
d(i - 1, j - 1) + cost)
Next j
Next i
LevenshteinDistance = d(ls, lt)
End Function
' 2. Jaccard
Public Function JaccardSimilarity(a As String, b As String) As Double
Dim dictA As Object, dictB As Object, w As Variant
Dim inter As Long, uni As Long
Set dictA = CreateObject("Scripting.Dictionary")
Set dictB = CreateObject("Scripting.Dictionary")
For Each w In Split(LCase(a))
If Len(w) > 0 Then dictA(w) = 1
Next
For Each w In Split(LCase(b))
If Len(w) > 0 Then dictB(w) = 1
Next
inter = 0
For Each w In dictA.Keys
If dictB.Exists(w) Then inter = inter + 1
Next
uni = dictA.Count + dictB.Count - inter
If uni = 0 Then JaccardSimilarity = 0 Else JaccardSimilarity = inter / uni
End Function
' 3. Jaro-Winkler
Public Function JaroWinkler(s1 As String, s2 As String) As Double
' Simplified implementation: adapt from Wikipedia pseudocode
Dim j As Double, jw As Double
Dim m As Long, t As Long
Dim prefix As Long, i As Long
m = 0: t = 0: prefix = 0
Dim matchDist As Long
matchDist = Int(Application.WorksheetFunction.Max(Len(s1), Len(s2)) / 2) - 1
Dim s1Flags() As Boolean, s2Flags() As Boolean
ReDim s1Flags(1 To Len(s1)), s2Flags(1 To Len(s2))
' Count matches
For i = 1 To Len(s1)
Dim startPos As Long, endPos As Long
startPos = Application.WorksheetFunction.Max(1, i - matchDist)
endPos = Application.WorksheetFunction.Min(Len(s2), i + matchDist)
Dim jdx As Long
For jdx = startPos To endPos
If Not s2Flags(jdx) And Mid$(s1, i, 1) = Mid$(s2, jdx, 1) Then
m = m + 1
s1Flags(i) = True
s2Flags(jdx) = True
Exit For
End If
Next jdx
Next
If m = 0 Then
JaroWinkler = 0
Exit Function
End If
' Count transpositions
Dim seq1 As String, seq2 As String
For i = 1 To Len(s1)
If s1Flags(i) Then seq1 = seq1 & Mid$(s1, i, 1)
Next
For j = 1 To Len(s2)
If s2Flags(j) Then seq2 = seq2 & Mid$(s2, j, 1)
Next
For i = 1 To Len(seq1)
If Mid$(seq1, i, 1) <> Mid$(seq2, i, 1) Then t = t + 1
Next
j = (m / Len(s1) + m / Len(s2) + (m - t / 2) / m) / 3
' Common prefix up to length 4
For i = 1 To Application.WorksheetFunction.Min(4, Len(s1), Len(s2))
If Mid$(s1, i, 1) = Mid$(s2, i, 1) Then prefix = prefix + 1 Else Exit For
Next
jw = j + 0.1 * prefix * (1 - j)
JaroWinkler = jw
End Function
' 4. Damerau-Levenshtein (full)
Public Function DamerauLevenshtein(s As String, t As String) As Long
Dim DA As Object
Set DA = CreateObject("Scripting.Dictionary")
Dim INF As Long
INF = Len(s) + Len(t)
Dim H() As Long
Dim i As Long, j As Long
ReDim H(0 To Len(s) + 2, 0 To Len(t) + 2)
H(0, 0) = INF
For i = 0 To Len(s): H(i + 1, 0) = INF: H(i + 1, 1) = i: Next
For j = 0 To Len(t): H(0, j + 1) = INF: H(1, j + 1) = j: Next
Dim db As Long, i1 As Long, j1 As Long
db = 0
For i = 1 To Len(s)
DA(Mid$(s, i, 1)) = 0
j1 = 0
For j = 1 To Len(t)
Dim i2 As Long: i2 = DA(Mid$(t, j, 1))
Dim cost As Long
If Mid$(s, i, 1) = Mid$(t, j, 1) Then cost = 0: db = j Else cost = 1
H(i + 1, j + 1) = Application.WorksheetFunction.Min( _
H(i, j) + cost, _
H(i + 1, j) + 1, _
H(i, j + 1) + 1, _
H(i2, j1) + (i - i2 - 1) + 1 + (j - j1 - 1))
j1 = j
Next
DA(Mid$(s, i, 1)) = i
Next
DamerauLevenshtein = H(Len(s) + 1, Len(t) + 1)
End Function
' 5. Cosine Similarity
Public Function CosineSimilarity(a As String, b As String) As Double
Dim dictA As Object, dictB As Object, w As Variant
Set dictA = CreateObject("Scripting.Dictionary")
Set dictB = CreateObject("Scripting.Dictionary")
For Each w In Split(LCase(a))
If Len(w) Then dictA(w) = dictA(w) + 1
Next
For Each w In Split(LCase(b))
If Len(w) Then dictB(w) = dictB(w) + 1
Next
Dim dot As Double, magA As Double, magB As Double
For Each w In dictA.Keys
dot = dot + dictA(w) * IIf(dictB.Exists(w), dictB(w), 0)
magA = magA + dictA(w) ^ 2
Next
For Each w In dictB.Keys
magB = magB + dictB(w) ^ 2
Next
If magA = 0 Or magB = 0 Then
CosineSimilarity = 0
Else
CosineSimilarity = dot / (Sqr(magA) * Sqr(magB))
End If
End Function
' 6. N-Gram Similarity (fixed)
Public Function NGramSimilarity(a As String, b As String, Optional q As Long = 3) As Double
Dim gramsA As Object, gramsB As Object
Dim i As Long, g As String
Dim inter As Long, sumA As Long, sumB As Long
Dim uni As Long
Dim key As Variant
' Initialize dictionaries
Set gramsA = CreateObject("Scripting.Dictionary")
Set gramsB = CreateObject("Scripting.Dictionary")
a = LCase$(Trim$(a))
b = LCase$(Trim$(b))
' If one string is shorter than q, no overlap possible
If Len(a) < q Or Len(b) < q Then
NGramSimilarity = 0
Exit Function
End If
' Build q-gram counts for a
For i = 1 To Len(a) - q + 1
g = Mid$(a, i, q)
If gramsA.Exists(g) Then
gramsA(g) = gramsA(g) + 1
Else
gramsA.Add g, 1
End If
Next i
' Build q-gram counts for b
For i = 1 To Len(b) - q + 1
g = Mid$(b, i, q)
If gramsB.Exists(g) Then
gramsB(g) = gramsB(g) + 1
Else
gramsB.Add g, 1
End If
Next i
' Compute intersection and sums
inter = 0: sumA = 0: sumB = 0
For Each key In gramsA.Keys
sumA = sumA + gramsA(key)
If gramsB.Exists(key) Then
inter = inter + Application.WorksheetFunction.Min(gramsA(key), gramsB(key))
End If
Next key
For Each key In gramsB.Keys
sumB = sumB + gramsB(key)
Next key
uni = sumA + sumB - inter
' Return similarity = intersection / union
If uni > 0 Then
NGramSimilarity = inter / uni
Else
NGramSimilarity = 0
End If
End Function
' 7. Sørensen–Dice
Public Function SorensenDice(a As String, b As String) As Double
Dim dictA As Object, dictB As Object, w As Variant, inter As Long
Set dictA = CreateObject("Scripting.Dictionary")
Set dictB = CreateObject("Scripting.Dictionary")
For Each w In Split(LCase(a))
If Len(w) Then dictA(w) = 1
Next
For Each w In Split(LCase(b))
If Len(w) Then dictB(w) = 1
Next
For Each w In dictA.Keys
If dictB.Exists(w) Then inter = inter + 1
Next
If dictA.Count + dictB.Count = 0 Then
SorensenDice = 0
Else
SorensenDice = 2 * inter / (dictA.Count + dictB.Count)
End If
End Function
' 8. Longest Common Subsequence (normalized)
Public Function LCSSimilarity(s As String, t As String) As Double
Dim dp() As Long
Dim i As Long, j As Long
Dim ls As Long, lt As Long
ls = Len(s): lt = Len(t)
ReDim dp(0 To ls, 0 To lt)
For i = 1 To ls
For j = 1 To lt
If Mid$(s, i, 1) = Mid$(t, j, 1) Then
dp(i, j) = dp(i - 1, j - 1) + 1
Else
dp(i, j) = Application.WorksheetFunction.Max(dp(i - 1, j), dp(i, j - 1))
End If
Next
Next
If Application.WorksheetFunction.Max(ls, lt) = 0 Then
LCSSimilarity = 0
Else
LCSSimilarity = dp(ls, lt) / Application.WorksheetFunction.Max(ls, lt)
End If
End Function
' 9. Hamming
Public Function HammingDistance(s As String, t As String) As Long
Dim i As Long
For i = 1 To Len(s)
If Mid$(s, i, 1) <> Mid$(t, i, 1) Then HammingDistance = HammingDistance + 1
Next
End Function
' 10. Q-Gram (normalized)
Public Function QGramSimilarity(a As String, b As String, Optional q As Long = 2) As Double
QGramSimilarity = NGramSimilarity(a, b, q)
End Function
' 11. Overlap Coefficient (Fixed)
Public Function OverlapCoeff(a As String, b As String) As Double
Dim dictA As Object, dictB As Object, w As Variant
Dim inter As Long
Set dictA = CreateObject("Scripting.Dictionary")
Set dictB = CreateObject("Scripting.Dictionary")
' Build token sets
For Each w In Split(LCase(a))
w = Trim$(w)
If Len(w) > 0 Then dictA(w) = 1
Next
For Each w In Split(LCase(b))
w = Trim$(w)
If Len(w) > 0 Then dictB(w) = 1
Next
' If either side is empty, similarity is zero
If dictA.Count = 0 Or dictB.Count = 0 Then
OverlapCoeff = 0
Exit Function
End If
' Count intersection
inter = 0
For Each w In dictA.Keys
If dictB.Exists(w) Then inter = inter + 1
Next
Dim minCount As Long
minCount = Application.WorksheetFunction.Min(dictA.Count, dictB.Count)
If minCount = 0 Then
OverlapCoeff = 0
Else
OverlapCoeff = CDbl(inter) / CDbl(minCount)
End If
End Function
' 12. Ratcliff–Obershelp (Gestalt) - Complete implementation
' Ratcliff–Obershelp (Gestalt) algorithm implementation
Public Function RatcliffObershelp(a As String, b As String) As Double
If Len(a) = 0 And Len(b) = 0 Then
RatcliffObershelp = 1 ' Two empty strings are identical
Exit Function
ElseIf Len(a) = 0 Or Len(b) = 0 Then
RatcliffObershelp = 0 ' One empty string means no similarity
Exit Function
End If
Dim matches As Long
matches = ROMatchCount(a, b)
RatcliffObershelp = (2 * matches) / (Len(a) + Len(b))
End Function
' Helper function for Ratcliff-Obershelp
Private Function ROMatchCount(a As String, b As String) As Long
If Len(a) = 0 Or Len(b) = 0 Then
ROMatchCount = 0
Exit Function
End If
' Find the longest common substring
Dim i As Long, j As Long, length As Long, maxLength As Long
Dim startA As Long, startB As Long
maxLength = 0
For i = 1 To Len(a)
For j = 1 To Len(b)
length = 0
' Find common substring starting at positions i and j
Do While (i + length <= Len(a)) And (j + length <= Len(b)) And _
(Mid$(a, i + length, 1) = Mid$(b, j + length, 1))
length = length + 1
Loop
If length > maxLength Then
maxLength = length
startA = i
startB = j
End If
Next j
Next i
If maxLength = 0 Then
ROMatchCount = 0
Exit Function
End If
' Recursively compute matches for prefixes and suffixes
Dim beforeA As String, afterA As String
Dim beforeB As String, afterB As String
beforeA = Left$(a, startA - 1)
afterA = Mid$(a, startA + maxLength)
beforeB = Left$(b, startB - 1)
afterB = Mid$(b, startB + maxLength)
ROMatchCount = maxLength + ROMatchCount(beforeA, beforeB) + ROMatchCount(afterA, afterB)
End Function
' 13. Optimal String Alignment
' (restricted Damerau-Levenshtein)
'===========================
Public Function OptimalStringAlignment(ByVal s As String, ByVal t As String) As Long
Dim ls As Long
Dim lt As Long
Dim d() As Long
Dim i As Long
Dim j As Long
Dim cost As Long
Dim transCost As Long
ls = Len(s)
lt = Len(t)
ReDim d(0 To ls, 0 To lt)
' Initialize base cases
For i = 0 To ls
d(i, 0) = i
Next i
For j = 0 To lt
d(0, j) = j
Next j
' Fill DP table
For i = 1 To ls
For j = 1 To lt
' substitution cost
If Mid$(s, i, 1) = Mid$(t, j, 1) Then
cost = 0
Else
cost = 1
End If
' deletion, insertion, substitution
d(i, j) = Min3(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
' restricted transposition:
' swap of adjacent chars costs 1
If i > 1 And j > 1 Then
If Mid$(s, i, 1) = Mid$(t, j - 1, 1) And Mid$(s, i - 1, 1) = Mid$(t, j, 1) Then
transCost = d(i - 2, j - 2) + 1
If transCost < d(i, j) Then
d(i, j) = transCost
End If
End If
End If
Next j
Next i
OptimalStringAlignment = d(ls, lt)
End Function
'===========================
' Helper: minimum of three Longs (Robust Version)
'===========================
Private Function Min3(ByVal a As Long, ByVal b As Long, ByVal c As Long) As Long
Dim minVal As Long
minVal = a
If b < minVal Then
minVal = b
End If
If c < minVal Then
minVal = c
End If
Min3 = minVal
End Function
' 14. Soundex – correct VBA implementation
Public Function SoundexCode(s As String) As String
Dim str As String, c As String
Dim result As String, lastDigit As String, digit As String
Dim i As Long
str = UCase$(Trim$(s))
If Len(str) = 0 Then
SoundexCode = ""
Exit Function
End If
' 1) Save first letter
result = Left$(str, 1)
lastDigit = GetSoundexDigit(result)
' 2) Process remaining letters
For i = 2 To Len(str)
c = Mid$(str, i, 1)
digit = GetSoundexDigit(c)
If digit <> "" Then
' only add if not the same as previous digit
If digit <> lastDigit Then
result = result & digit
lastDigit = digit
End If
Else
' vowels and H, W, Y reset lastDigit so next real digit isn't suppressed
lastDigit = ""
End If
If Len(result) = 4 Then Exit For
Next i
' 3) Pad with zeros or truncate to 4 chars
If Len(result) < 4 Then
result = result & String$(4 - Len(result), "0")
ElseIf Len(result) > 4 Then
result = Left$(result, 4)
End If
SoundexCode = result
End Function
' Helper: map a letter to its Soundex digit (or "" for vowels/H/W/Y)
Private Function GetSoundexDigit(c As String) As String
Select Case c
Case "B", "P", "F", "V": GetSoundexDigit = "1"
Case "C", "S", "K", "G", "J", "Q", "X", "Z": GetSoundexDigit = "2"
Case "D", "T": GetSoundexDigit = "3"
Case "L": GetSoundexDigit = "4"
Case "M", "N": GetSoundexDigit = "5"
Case "R": GetSoundexDigit = "6"
Case Else: GetSoundexDigit = ""
End Select
End Function
' 17. Tversky Index
Public Function TverskyIndex(a As String, b As String, _
Optional alpha As Double = 0.5, _
Optional beta As Double = 0.5 _
) As Double
Dim dictA As Object, dictB As Object, w As Variant
Dim inter As Long, onlyA As Long, onlyB As Long
Set dictA = CreateObject("Scripting.Dictionary")
Set dictB = CreateObject("Scripting.Dictionary")
For Each w In Split(LCase(a))
If Len(w) Then dictA(w) = 1
Next
For Each w In Split(LCase(b))
If Len(w) Then dictB(w) = 1
Next
For Each w In dictA.Keys
If dictB.Exists(w) Then
inter = inter + 1
Else
onlyA = onlyA + 1
End If
Next
For Each w In dictB.Keys
If Not dictA.Exists(w) Then onlyB = onlyB + 1
Next
If inter + alpha * onlyA + beta * onlyB = 0 Then
TverskyIndex = 0
Else
TverskyIndex = inter / (inter + alpha * onlyA + beta * onlyB)
End If
End Function
' Additional utility/helper functions
' Function to convert string to character array for easier processing
Private Function StringToArray(s As String) As Variant
Dim arr() As String
Dim i As Long
ReDim arr(1 To Len(s))
For i = 1 To Len(s)
arr(i) = Mid$(s, i, 1)
Next
StringToArray = arr
End Function
' Function to get n-grams from a string
Private Function GetNGrams(s As String, n As Long) As Object
Dim ngrams As Object
Set ngrams = CreateObject("Scripting.Dictionary")
Dim i As Long
For i = 1 To Len(s) - n + 1
Dim gram As String
gram = Mid$(s, i, n)
ngrams(gram) = ngrams(gram) + 1
Next
Set GetNGrams = ngrams
End Function
' Function to count common characters between two strings for Jaro similarity
Private Function CountCommonChars(s1 As String, s2 As String, maxDist As Long) As Long
Dim matches As Long
Dim s1Flags() As Boolean, s2Flags() As Boolean
ReDim s1Flags(1 To Len(s1))
ReDim s2Flags(1 To Len(s2))
' Count matching characters within maxDist
Dim i As Long, j As Long
For i = 1 To Len(s1)
Dim start As Long, finish As Long
start = Application.WorksheetFunction.Max(1, i - maxDist)
finish = Application.WorksheetFunction.Min(Len(s2), i + maxDist)
For j = start To finish
If Not s2Flags(j) And Mid$(s1, i, 1) = Mid$(s2, j, 1) Then
s1Flags(i) = True
s2Flags(j) = True
matches = matches + 1
Exit For
End If
Next j
Next i
CountCommonChars = matches
End Function
' Function to count transpositions for Jaro similarity
Private Function CountTranspositions(s1 As String, s2 As String, s1Flags() As Boolean, s2Flags() As Boolean) As Long
Dim transpositions As Long
Dim k1 As Long, k2 As Long
k1 = 1
k2 = 1
' Count half-transpositions
Dim i As Long
For i = 1 To Len(s1)
If s1Flags(i) Then
Do While Not s2Flags(k2)
k2 = k2 + 1
Loop
If Mid$(s1, i, 1) <> Mid$(s2, k2, 1) Then
transpositions = transpositions + 1
End If
k2 = k2 + 1
End If
Next i
' Integer division by 2 to get full transpositions
CountTranspositions = transpositions \ 2
End Function
' Function to tokenize text into words
Private Function TokenizeWords(s As String) As Object
Dim words As Object
Set words = CreateObject("Scripting.Dictionary")
Dim w As Variant
For Each w In Split(LCase(s))
If Len(w) Then words(w) = words(w) + 1
Next
Set TokenizeWords = words
End Function
' Function to get character sets
Private Function GetCharacterSet(s As String) As Object
Dim charSet As Object
Set charSet = CreateObject("Scripting.Dictionary")
Dim i As Long
For i = 1 To Len(s)
charSet(Mid$(s, i, 1)) = charSet(Mid$(s, i, 1)) + 1
Next
Set GetCharacterSet = charSet
End Function
I was searching alot for something like this? Just declaring this is very innovative. Thanks again.