when finding duplicates doesn’t quite cut it

Need for finding semantic similarity in Excel

When we are cleaning up data in Excel, the built-in identifying and removing duplicates tool is great for spotting exact matches, but we need to do better than that. Because the real‑world text data is messy and finding exact match for duplicate identification is not enough because of the following reasons:

Typos and variations: Two entries like “Shreedevi Kapoor” vs. “Shreedevi Kapur” won’t be caught as duplicates even though they likely refer to the same person.
Synonyms and paraphrases: Phrases like “purchase order” vs. “order for purchase” or “NYC” vs. “New York City” looks different character‑for‑character, yet carries the same meaning.
Transpositions and small edits: Swapped letters (“adress” vs. “address”) or small insertions/deletions can hide duplicates from simple exact‑match filters.
A semantic similarity finder lets us score how close two strings are on a scale from 0 (completely different) to 1 (identical). That score can then drive fuzzy‑matching pipelines – flagging likely duplicates even when the text doesn’t match exactly, reducing manual review and improving the quality of the data we are working with.

Approaches Used in the VBA Code

The custom SemanticSimilarity(text1, text2, method) function supports a wide range of algorithms. Below is a quick overview of each, with a small illustration or “visual” tip for how they work:

Method	What It Does	Example
Levenshtein	Counts the minimum single‐character edits (insertions, deletions, substitutions) to turn one string into another.	“kitten” → “sitting” requires 3 edits: k→s, e→i, +g
Damerau‑Levenshtein	Like Levenshtein, but also allows swapping two adjacent characters as a single operation.	“ca” → “ac” is 1 transposition (swap), not 2 separate edits
Optimal String Alignment (OSA)	A restricted version of Damerau‑Levenshtein where each substring can be edited at most once—prevents “double counting” overlapping edits.	“CA” → “ABC” still counts as 2 edits (insert A, substitute C→B), even though a naive DL might count more
Hamming	Measures the percentage of positions that differ—but only valid when both strings are the same length.	“1011101” vs. “1001001” differ in 2 of 7 positions → similarity = 1 – 2/7 ≈ 0.714
Longest Common Subsequence (LCS)	Finds the longest sequence of matching characters (not necessarily contiguous), then divides its length by the longer string’s length.	“ABCDFG” & “AEDFHR” share subsequence “ADF” of length 3 → similarity = 3/6 = 0.5
Jaccard	Treats each string as a set of tokens; similarity = (size of the intersection) divided by (size of the union).	A = {“red”, “blue”}, B = {“blue”, “green”} → intersection has 1 token, union has 3 → score = 1 ÷ 3 ≈ 0.33
Sørensen‑Dice	Like Jaccard but gives twice as much weight to the shared tokens: similarity = 2 × (intersection size) ÷ (sum of the two set sizes).	Same A and B above → score = 2 × 1 ÷ (2+2) = 2 ÷ 4 = 0.5
Overlap Coefficient	Intersection size divided by the smaller set’s size—maximizes sensitivity when one set is much smaller.	A = {1,2,3}, B = {2,3,4,5} → int = {2,3}=2, smaller size = 3 → score = 2/3 ≈ 0.67
Jaro‑Winkler	Looks for matching characters within a “window” around each position and lightly penalizes transpositions; gives a bonus for common prefix.	“MARTHA” vs. “MARHTA”: most letters match within window, only one swap → score ≈ 0.961
Ratcliff‑Obershelp (Gestalt)	Recursively finds the longest common substring, then does the same on the text before and after that match, scoring by total matched length.	“ABCXYZABC” vs. “XYZABCXYZ” → longest common “XYZABC” length 6 → plus recursive checks on leftovers → overall similarity ≈ 0.67
N‑Gram (default n = 3)	Breaks each string into overlapping sequences of n characters and computes Jaccard over those n‑grams.	“data” → {“dat”,“ata”}; “date” → {“dat”,“ate”} → int={“dat”}=1, union=3 → score=1/3 ≈ 0.33
Q‑Gram (default q = 2)	Same as N‑Gram but with q=2, useful for shorter strings.	“hi” → {“hi”}, “hi!”→{“hi”,“i!”} → int={“hi”}=1, union=2 → score=1/2 = 0.5
Cosine	Converts each string into a word‑frequency vector and measures the cosine of the angle between them (dot product normalized).	“red blue red” vs. “blue green red” → vectors [red:2,blue:1] vs. [red:1,blue:1,green:1] → dot=2,
Tversky	Generalizes Jaccard by weighting non‑shared tokens differently via α, β parameters.	A={a,b,c}, B={b,c,d,e}, with α = β = 0.5 → score = 2/(2 + 0.5×1 + 0.5×2) = 2/3.5 ≈ 0.57
Soundex	Encodes words by their English phonetic pattern; returns 1 if codes match exactly, else 0.	“Robert” and “Rupert” both → R‐163 → similarity = 1; “Ashcraft” vs. “Ashcroft” → both → A‐261 → similarity = 1

How to Call the Function: Argument Table

You can plug any of the above method names into the SemanticSimilarity UDF. Here’s a quick reference:

Method Name (string)	Description / VBA Call Example
“levenshtein”	=SemanticSimilarity(A1, B1, “Levenshtein”)
“damerau-levenshtein”	=SemanticSimilarity(A1, B1, “Damerau-Levenshtein”)
“osa” or “optimal string alignment”	=SemanticSimilarity(A1, B1, “OSA”)
“hamming”	=SemanticSimilarity(A1, B1, “Hamming”)
“longest common subsequence” or “lcs”	=SemanticSimilarity(A1, B1, “LCS”)
“jaccard”	=SemanticSimilarity(A1, B1, “Jaccard”)
“jaro-winkler”	=SemanticSimilarity(A1, B1, “Jaro-Winkler”)
“ratcliff-obershelp” or “gestalt”	=SemanticSimilarity(A1, B1, “Ratcliff-Obershelp”)
“n-gram”	=SemanticSimilarity(A1, B1, “N-Gram”)
“q-gram”	=SemanticSimilarity(A1, B1, “Q-Gram”)
“sörensen-dice” or “sorensen-dice”	=SemanticSimilarity(A1, B1, “Sorensen-Dice”)
“overlap”	=SemanticSimilarity(A1, B1, “Overlap”)
“cosine”	=SemanticSimilarity(A1, B1, “Cosine”)
“tversky”	=SemanticSimilarity(A1, B1, “Tversky”)
“soundex”	=SemanticSimilarity(A1, B1, “Soundex”)

How to Install This VBA Add‑In

You can plug any of the above method names into the SemanticSimilarity UDF. Here’s a quick reference:

Step 1: Download the Excel Addin File from this link. Download

Step 2: Install the Addin File
(i) Open your “Microsoft Excel”
(ii) Go to “Files:
(ii) Go to “Options”
(iv) Go to “Add-ins” tab in the new window
(v) At the bottom at “Manage” Section choose “Excel Add-ins” and click “Go”
(vi) Click “Browse” and browse and select the file downloaded from Step 1
(vii) Click “OK” and return to Excel Spreadsheet

Examples / Screenshot

You can plug any of the above method names into the SemanticSimilarity UDF. Here’s a quick reference:

The preceding illustrations shows four specific methodologies drawn from our earlier discussion of fifteen semantic similarity techniques. Upon examining the outcomes, it becomes evident that these selected methods are adept at finding the subtle nuances of semantic proximity between near-similar words and semantically similar phrases. This ability to quantify such relationships is crucial for a range of analytical purposes, including the exploration of lexical semantics and the practical identification of redundant information that might require further analysis or removal to improve data quality and efficiency.

VBA Code

VBA code for the all the 15 semantic similarity finder is as follows for reference.

            Option Explicit

' Main UDF: returns a similarity score [0..1] for the chosen method
Public Function SemanticSimilarity( _
    ByVal text1 As String, _
    ByVal text2 As String, _
    ByVal method As String _
) As Double
    Dim m As String
    m = LCase(Trim$(method))
    
    ' Route to the appropriate algorithm
    Select Case m
        Case "levenshtein"
            SemanticSimilarity = 1 - LevenshteinDistance(text1, text2) / _
                Application.WorksheetFunction.Max(Len(text1), Len(text2))
        
        Case "jaccard"
            SemanticSimilarity = JaccardSimilarity(text1, text2)
        
        Case "jaro-winkler"
            SemanticSimilarity = JaroWinkler(text1, text2)
        
        Case "damerau-levenshtein"
            SemanticSimilarity = 1 - DamerauLevenshtein(text1, text2) / _
                Application.WorksheetFunction.Max(Len(text1), Len(text2))
        
        Case "cosine"
            SemanticSimilarity = CosineSimilarity(text1, text2)
        
        Case "n-gram"
            SemanticSimilarity = NGramSimilarity(text1, text2, 3)
        
        Case "sörensen-dice", "sorensen-dice"
            SemanticSimilarity = SorensenDice(text1, text2)
        
        Case "longest common subsequence", "lcs"
            SemanticSimilarity = LCSSimilarity(text1, text2)
        
        Case "hamming"
            SemanticSimilarity = 1 - HammingDistance(text1, text2) / _
                Application.WorksheetFunction.Max(Len(text1), Len(text2))
        
        Case "q-gram"
            SemanticSimilarity = QGramSimilarity(text1, text2, 2)
        
        Case "overlap"
            SemanticSimilarity = OverlapCoeff(text1, text2)
        
        Case "ratcliff-obershelp", "gestalt"
            SemanticSimilarity = RatcliffObershelp(text1, text2)
        
        Case "osa", "optimal string alignment"
            SemanticSimilarity = 1 - OptimalStringAlignment(text1, text2) / _
                Application.WorksheetFunction.Max(Len(text1), Len(text2))

        Case "soundex"
            SemanticSimilarity = IIf(SoundexCode(text1) = SoundexCode(text2), 1, 0)
                        
        Case "tversky"
            SemanticSimilarity = TverskyIndex(text1, text2, 0.5, 0.5)
        
        Case Else
            SemanticSimilarity = CVErr(xlErrNA)
    End Select
End Function

'=========================
' Helper Implementations
'=========================
' 1. Levenshtein
Public Function LevenshteinDistance(s As String, t As String) As Long
    Dim d() As Long
    Dim i As Long, j As Long, cost As Long
    Dim ls As Long, lt As Long
    
    ls = Len(s): lt = Len(t)
    ReDim d(0 To ls, 0 To lt)
    
    For i = 0 To ls: d(i, 0) = i: Next i
    For j = 0 To lt: d(0, j) = j: Next j
    
    For i = 1 To ls
        For j = 1 To lt
            If Mid$(s, i, 1) = Mid$(t, j, 1) Then cost = 0 Else cost = 1
            d(i, j) = Application.WorksheetFunction.Min( _
                d(i - 1, j) + 1, _
                d(i, j - 1) + 1, _
                d(i - 1, j - 1) + cost)
        Next j
    Next i
    
    LevenshteinDistance = d(ls, lt)
End Function

' 2. Jaccard
Public Function JaccardSimilarity(a As String, b As String) As Double
    Dim dictA As Object, dictB As Object, w As Variant
    Dim inter As Long, uni As Long
    Set dictA = CreateObject("Scripting.Dictionary")
    Set dictB = CreateObject("Scripting.Dictionary")
    
    For Each w In Split(LCase(a))
        If Len(w) > 0 Then dictA(w) = 1
    Next
    For Each w In Split(LCase(b))
        If Len(w) > 0 Then dictB(w) = 1
    Next
    
    inter = 0
    For Each w In dictA.Keys
        If dictB.Exists(w) Then inter = inter + 1
    Next
    uni = dictA.Count + dictB.Count - inter
    
    If uni = 0 Then JaccardSimilarity = 0 Else JaccardSimilarity = inter / uni
End Function

' 3. Jaro-Winkler
Public Function JaroWinkler(s1 As String, s2 As String) As Double
    ' Simplified implementation: adapt from Wikipedia pseudocode
    Dim j As Double, jw As Double
    Dim m As Long, t As Long
    Dim prefix As Long, i As Long
    
    m = 0: t = 0: prefix = 0
    Dim matchDist As Long
    matchDist = Int(Application.WorksheetFunction.Max(Len(s1), Len(s2)) / 2) - 1
    
    Dim s1Flags() As Boolean, s2Flags() As Boolean
    ReDim s1Flags(1 To Len(s1)), s2Flags(1 To Len(s2))
    
    ' Count matches
    For i = 1 To Len(s1)
        Dim startPos As Long, endPos As Long
        startPos = Application.WorksheetFunction.Max(1, i - matchDist)
        endPos = Application.WorksheetFunction.Min(Len(s2), i + matchDist)
        Dim jdx As Long
        For jdx = startPos To endPos
            If Not s2Flags(jdx) And Mid$(s1, i, 1) = Mid$(s2, jdx, 1) Then
                m = m + 1
                s1Flags(i) = True
                s2Flags(jdx) = True
                Exit For
            End If
        Next jdx
    Next
    
    If m = 0 Then
        JaroWinkler = 0
        Exit Function
    End If
    
    ' Count transpositions
    Dim seq1 As String, seq2 As String
    For i = 1 To Len(s1)
        If s1Flags(i) Then seq1 = seq1 & Mid$(s1, i, 1)
    Next
    For j = 1 To Len(s2)
        If s2Flags(j) Then seq2 = seq2 & Mid$(s2, j, 1)
    Next
    For i = 1 To Len(seq1)
        If Mid$(seq1, i, 1) <> Mid$(seq2, i, 1) Then t = t + 1
    Next
    
    j = (m / Len(s1) + m / Len(s2) + (m - t / 2) / m) / 3
    
    ' Common prefix up to length 4
    For i = 1 To Application.WorksheetFunction.Min(4, Len(s1), Len(s2))
        If Mid$(s1, i, 1) = Mid$(s2, i, 1) Then prefix = prefix + 1 Else Exit For
    Next
    
    jw = j + 0.1 * prefix * (1 - j)
    JaroWinkler = jw
End Function

' 4. Damerau-Levenshtein (full)
Public Function DamerauLevenshtein(s As String, t As String) As Long
    Dim DA As Object
    Set DA = CreateObject("Scripting.Dictionary")
    Dim INF As Long
    INF = Len(s) + Len(t)
    
    Dim H() As Long
    Dim i As Long, j As Long
    ReDim H(0 To Len(s) + 2, 0 To Len(t) + 2)
    H(0, 0) = INF
    For i = 0 To Len(s): H(i + 1, 0) = INF: H(i + 1, 1) = i: Next
    For j = 0 To Len(t): H(0, j + 1) = INF: H(1, j + 1) = j: Next
    
    Dim db As Long, i1 As Long, j1 As Long
    db = 0
    For i = 1 To Len(s)
        DA(Mid$(s, i, 1)) = 0
        j1 = 0
        For j = 1 To Len(t)
            Dim i2 As Long: i2 = DA(Mid$(t, j, 1))
            Dim cost As Long
            If Mid$(s, i, 1) = Mid$(t, j, 1) Then cost = 0: db = j Else cost = 1
            H(i + 1, j + 1) = Application.WorksheetFunction.Min( _
                H(i, j) + cost, _
                H(i + 1, j) + 1, _
                H(i, j + 1) + 1, _
                H(i2, j1) + (i - i2 - 1) + 1 + (j - j1 - 1))
            j1 = j
        Next
        DA(Mid$(s, i, 1)) = i
    Next
    
    DamerauLevenshtein = H(Len(s) + 1, Len(t) + 1)
End Function

' 5. Cosine Similarity
Public Function CosineSimilarity(a As String, b As String) As Double
    Dim dictA As Object, dictB As Object, w As Variant
    Set dictA = CreateObject("Scripting.Dictionary")
    Set dictB = CreateObject("Scripting.Dictionary")
    
    For Each w In Split(LCase(a))
        If Len(w) Then dictA(w) = dictA(w) + 1
    Next
    For Each w In Split(LCase(b))
        If Len(w) Then dictB(w) = dictB(w) + 1
    Next
    
    Dim dot As Double, magA As Double, magB As Double
    For Each w In dictA.Keys
        dot = dot + dictA(w) * IIf(dictB.Exists(w), dictB(w), 0)
        magA = magA + dictA(w) ^ 2
    Next
    For Each w In dictB.Keys
        magB = magB + dictB(w) ^ 2
    Next
    
    If magA = 0 Or magB = 0 Then
        CosineSimilarity = 0
    Else
        CosineSimilarity = dot / (Sqr(magA) * Sqr(magB))
    End If
End Function

' 6. N-Gram Similarity (fixed)
Public Function NGramSimilarity(a As String, b As String, Optional q As Long = 3) As Double
    Dim gramsA As Object, gramsB As Object
    Dim i As Long, g As String
    Dim inter As Long, sumA As Long, sumB As Long
    Dim uni As Long
    Dim key As Variant
    
    ' Initialize dictionaries
    Set gramsA = CreateObject("Scripting.Dictionary")
    Set gramsB = CreateObject("Scripting.Dictionary")
    
    a = LCase$(Trim$(a))
    b = LCase$(Trim$(b))
    
    ' If one string is shorter than q, no overlap possible
    If Len(a) < q Or Len(b) < q Then
        NGramSimilarity = 0
        Exit Function
    End If
    
    ' Build q-gram counts for a
    For i = 1 To Len(a) - q + 1
        g = Mid$(a, i, q)
        If gramsA.Exists(g) Then
            gramsA(g) = gramsA(g) + 1
        Else
            gramsA.Add g, 1
        End If
    Next i
    
    ' Build q-gram counts for b
    For i = 1 To Len(b) - q + 1
        g = Mid$(b, i, q)
        If gramsB.Exists(g) Then
            gramsB(g) = gramsB(g) + 1
        Else
            gramsB.Add g, 1
        End If
    Next i
    
    ' Compute intersection and sums
    inter = 0: sumA = 0: sumB = 0
    For Each key In gramsA.Keys
        sumA = sumA + gramsA(key)
        If gramsB.Exists(key) Then
            inter = inter + Application.WorksheetFunction.Min(gramsA(key), gramsB(key))
        End If
    Next key
    
    For Each key In gramsB.Keys
        sumB = sumB + gramsB(key)
    Next key
    
    uni = sumA + sumB - inter
    
    ' Return similarity = intersection / union
    If uni > 0 Then
        NGramSimilarity = inter / uni
    Else
        NGramSimilarity = 0
    End If
End Function


' 7. Sørensen–Dice
Public Function SorensenDice(a As String, b As String) As Double
    Dim dictA As Object, dictB As Object, w As Variant, inter As Long
    Set dictA = CreateObject("Scripting.Dictionary")
    Set dictB = CreateObject("Scripting.Dictionary")
    
    For Each w In Split(LCase(a))
        If Len(w) Then dictA(w) = 1
    Next
    For Each w In Split(LCase(b))
        If Len(w) Then dictB(w) = 1
    Next
    
    For Each w In dictA.Keys
        If dictB.Exists(w) Then inter = inter + 1
    Next
    
    If dictA.Count + dictB.Count = 0 Then
        SorensenDice = 0
    Else
        SorensenDice = 2 * inter / (dictA.Count + dictB.Count)
    End If
End Function

' 8. Longest Common Subsequence (normalized)
Public Function LCSSimilarity(s As String, t As String) As Double
    Dim dp() As Long
    Dim i As Long, j As Long
    Dim ls As Long, lt As Long
    
    ls = Len(s): lt = Len(t)
    ReDim dp(0 To ls, 0 To lt)
    
    For i = 1 To ls
        For j = 1 To lt
            If Mid$(s, i, 1) = Mid$(t, j, 1) Then
                dp(i, j) = dp(i - 1, j - 1) + 1
            Else
                dp(i, j) = Application.WorksheetFunction.Max(dp(i - 1, j), dp(i, j - 1))
            End If
        Next
    Next
    
    If Application.WorksheetFunction.Max(ls, lt) = 0 Then
        LCSSimilarity = 0
    Else
        LCSSimilarity = dp(ls, lt) / Application.WorksheetFunction.Max(ls, lt)
    End If
End Function

' 9. Hamming
Public Function HammingDistance(s As String, t As String) As Long
    Dim i As Long
    For i = 1 To Len(s)
        If Mid$(s, i, 1) <> Mid$(t, i, 1) Then HammingDistance = HammingDistance + 1
    Next
End Function

' 10. Q-Gram (normalized)
Public Function QGramSimilarity(a As String, b As String, Optional q As Long = 2) As Double
    QGramSimilarity = NGramSimilarity(a, b, q)
End Function

' 11. Overlap Coefficient (Fixed)
Public Function OverlapCoeff(a As String, b As String) As Double
    Dim dictA As Object, dictB As Object, w As Variant
    Dim inter As Long

    Set dictA = CreateObject("Scripting.Dictionary")
    Set dictB = CreateObject("Scripting.Dictionary")

    ' Build token sets
    For Each w In Split(LCase(a))
        w = Trim$(w)
        If Len(w) > 0 Then dictA(w) = 1
    Next
    For Each w In Split(LCase(b))
        w = Trim$(w)
        If Len(w) > 0 Then dictB(w) = 1
    Next

    ' If either side is empty, similarity is zero
    If dictA.Count = 0 Or dictB.Count = 0 Then
        OverlapCoeff = 0
        Exit Function
    End If

    ' Count intersection
    inter = 0
    For Each w In dictA.Keys
        If dictB.Exists(w) Then inter = inter + 1
    Next
    
    Dim minCount As Long
    minCount = Application.WorksheetFunction.Min(dictA.Count, dictB.Count)
    If minCount = 0 Then
        OverlapCoeff = 0
    Else
        OverlapCoeff = CDbl(inter) / CDbl(minCount)
    End If
End Function



' 12. Ratcliff–Obershelp (Gestalt) - Complete implementation

' Ratcliff–Obershelp (Gestalt) algorithm implementation
Public Function RatcliffObershelp(a As String, b As String) As Double
    If Len(a) = 0 And Len(b) = 0 Then
        RatcliffObershelp = 1  ' Two empty strings are identical
        Exit Function
    ElseIf Len(a) = 0 Or Len(b) = 0 Then
        RatcliffObershelp = 0  ' One empty string means no similarity
        Exit Function
    End If
    
    Dim matches As Long
    matches = ROMatchCount(a, b)
    
    RatcliffObershelp = (2 * matches) / (Len(a) + Len(b))
End Function

' Helper function for Ratcliff-Obershelp
Private Function ROMatchCount(a As String, b As String) As Long
    If Len(a) = 0 Or Len(b) = 0 Then
        ROMatchCount = 0
        Exit Function
    End If
    
    ' Find the longest common substring
    Dim i As Long, j As Long, length As Long, maxLength As Long
    Dim startA As Long, startB As Long
    maxLength = 0
    
    For i = 1 To Len(a)
        For j = 1 To Len(b)
            length = 0
            ' Find common substring starting at positions i and j
            Do While (i + length <= Len(a)) And (j + length <= Len(b)) And _
                     (Mid$(a, i + length, 1) = Mid$(b, j + length, 1))
                length = length + 1
            Loop
            
            If length > maxLength Then
                maxLength = length
                startA = i
                startB = j
            End If
        Next j
    Next i
    
    If maxLength = 0 Then
        ROMatchCount = 0
        Exit Function
    End If
    
    ' Recursively compute matches for prefixes and suffixes
    Dim beforeA As String, afterA As String
    Dim beforeB As String, afterB As String
    beforeA = Left$(a, startA - 1)
    afterA = Mid$(a, startA + maxLength)
    beforeB = Left$(b, startB - 1)
    afterB = Mid$(b, startB + maxLength)
    
    ROMatchCount = maxLength + ROMatchCount(beforeA, beforeB) + ROMatchCount(afterA, afterB)
End Function


' 13. Optimal String Alignment
' (restricted Damerau-Levenshtein)
'===========================
Public Function OptimalStringAlignment(ByVal s As String, ByVal t As String) As Long
    Dim ls As Long
    Dim lt As Long
    Dim d() As Long
    Dim i As Long
    Dim j As Long
    Dim cost As Long
    Dim transCost As Long

    ls = Len(s)
    lt = Len(t)

    ReDim d(0 To ls, 0 To lt)

    ' Initialize base cases
    For i = 0 To ls
        d(i, 0) = i
    Next i
    For j = 0 To lt
        d(0, j) = j
    Next j

    ' Fill DP table
    For i = 1 To ls
        For j = 1 To lt
            ' substitution cost
            If Mid$(s, i, 1) = Mid$(t, j, 1) Then
                cost = 0
            Else
                cost = 1
            End If

            ' deletion, insertion, substitution
            d(i, j) = Min3(d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)

            ' restricted transposition:
            '    swap of adjacent chars costs 1
            If i > 1 And j > 1 Then
                If Mid$(s, i, 1) = Mid$(t, j - 1, 1) And Mid$(s, i - 1, 1) = Mid$(t, j, 1) Then
                    transCost = d(i - 2, j - 2) + 1
                    If transCost < d(i, j) Then
                        d(i, j) = transCost
                    End If
                End If
            End If
        Next j
    Next i

    OptimalStringAlignment = d(ls, lt)
End Function

'===========================
' Helper: minimum of three Longs (Robust Version)
'===========================
Private Function Min3(ByVal a As Long, ByVal b As Long, ByVal c As Long) As Long
    Dim minVal As Long
    minVal = a
    If b < minVal Then
        minVal = b
    End If
    If c < minVal Then
        minVal = c
    End If
    Min3 = minVal
End Function

' 14. Soundex – correct VBA implementation
Public Function SoundexCode(s As String) As String
    Dim str As String, c As String
    Dim result As String, lastDigit As String, digit As String
    Dim i As Long

    str = UCase$(Trim$(s))
    If Len(str) = 0 Then
        SoundexCode = ""
        Exit Function
    End If

    ' 1) Save first letter
    result = Left$(str, 1)
    lastDigit = GetSoundexDigit(result)

    ' 2) Process remaining letters
    For i = 2 To Len(str)
        c = Mid$(str, i, 1)
        digit = GetSoundexDigit(c)
        If digit <> "" Then
            ' only add if not the same as previous digit
            If digit <> lastDigit Then
                result = result & digit
                lastDigit = digit
            End If
        Else
            ' vowels and H, W, Y reset lastDigit so next real digit isn't suppressed
            lastDigit = ""
        End If
        If Len(result) = 4 Then Exit For
    Next i

    ' 3) Pad with zeros or truncate to 4 chars
    If Len(result) < 4 Then
        result = result & String$(4 - Len(result), "0")
    ElseIf Len(result) > 4 Then
        result = Left$(result, 4)
    End If

    SoundexCode = result
End Function

' Helper: map a letter to its Soundex digit (or "" for vowels/H/W/Y)
Private Function GetSoundexDigit(c As String) As String
    Select Case c
        Case "B", "P", "F", "V": GetSoundexDigit = "1"
        Case "C", "S", "K", "G", "J", "Q", "X", "Z": GetSoundexDigit = "2"
        Case "D", "T": GetSoundexDigit = "3"
        Case "L": GetSoundexDigit = "4"
        Case "M", "N": GetSoundexDigit = "5"
        Case "R": GetSoundexDigit = "6"
        Case Else: GetSoundexDigit = ""
    End Select
End Function

' 17. Tversky Index
Public Function TverskyIndex(a As String, b As String, _
    Optional alpha As Double = 0.5, _
    Optional beta As Double = 0.5 _
) As Double
    Dim dictA As Object, dictB As Object, w As Variant
    Dim inter As Long, onlyA As Long, onlyB As Long
    Set dictA = CreateObject("Scripting.Dictionary")
    Set dictB = CreateObject("Scripting.Dictionary")
    
    For Each w In Split(LCase(a))
        If Len(w) Then dictA(w) = 1
    Next
    For Each w In Split(LCase(b))
        If Len(w) Then dictB(w) = 1
    Next
    
    For Each w In dictA.Keys
        If dictB.Exists(w) Then
            inter = inter + 1
        Else
            onlyA = onlyA + 1
        End If
    Next
    For Each w In dictB.Keys
        If Not dictA.Exists(w) Then onlyB = onlyB + 1
    Next
    
    If inter + alpha * onlyA + beta * onlyB = 0 Then
        TverskyIndex = 0
    Else
        TverskyIndex = inter / (inter + alpha * onlyA + beta * onlyB)
    End If
End Function

' Additional utility/helper functions

' Function to convert string to character array for easier processing
Private Function StringToArray(s As String) As Variant
    Dim arr() As String
    Dim i As Long
    ReDim arr(1 To Len(s))
    
    For i = 1 To Len(s)
        arr(i) = Mid$(s, i, 1)
    Next
    
    StringToArray = arr
End Function

' Function to get n-grams from a string
Private Function GetNGrams(s As String, n As Long) As Object
    Dim ngrams As Object
    Set ngrams = CreateObject("Scripting.Dictionary")
    
    Dim i As Long
    For i = 1 To Len(s) - n + 1
        Dim gram As String
        gram = Mid$(s, i, n)
        ngrams(gram) = ngrams(gram) + 1
    Next
    
    Set GetNGrams = ngrams
End Function

' Function to count common characters between two strings for Jaro similarity
Private Function CountCommonChars(s1 As String, s2 As String, maxDist As Long) As Long
    Dim matches As Long
    Dim s1Flags() As Boolean, s2Flags() As Boolean
    
    ReDim s1Flags(1 To Len(s1))
    ReDim s2Flags(1 To Len(s2))
    
    ' Count matching characters within maxDist
    Dim i As Long, j As Long
    For i = 1 To Len(s1)
        Dim start As Long, finish As Long
        start = Application.WorksheetFunction.Max(1, i - maxDist)
        finish = Application.WorksheetFunction.Min(Len(s2), i + maxDist)
        
        For j = start To finish
            If Not s2Flags(j) And Mid$(s1, i, 1) = Mid$(s2, j, 1) Then
                s1Flags(i) = True
                s2Flags(j) = True
                matches = matches + 1
                Exit For
            End If
        Next j
    Next i
    
    CountCommonChars = matches
End Function

' Function to count transpositions for Jaro similarity
Private Function CountTranspositions(s1 As String, s2 As String, s1Flags() As Boolean, s2Flags() As Boolean) As Long
    Dim transpositions As Long
    Dim k1 As Long, k2 As Long
    
    k1 = 1
    k2 = 1
    
    ' Count half-transpositions
    Dim i As Long
    For i = 1 To Len(s1)
        If s1Flags(i) Then
            Do While Not s2Flags(k2)
                k2 = k2 + 1
            Loop
            
            If Mid$(s1, i, 1) <> Mid$(s2, k2, 1) Then
                transpositions = transpositions + 1
            End If
            
            k2 = k2 + 1
        End If
    Next i
    
    ' Integer division by 2 to get full transpositions
    CountTranspositions = transpositions \ 2
End Function

' Function to tokenize text into words
Private Function TokenizeWords(s As String) As Object
    Dim words As Object
    Set words = CreateObject("Scripting.Dictionary")
    
    Dim w As Variant
    For Each w In Split(LCase(s))
        If Len(w) Then words(w) = words(w) + 1
    Next
    
    Set TokenizeWords = words
End Function

' Function to get character sets
Private Function GetCharacterSet(s As String) As Object
    Dim charSet As Object
    Set charSet = CreateObject("Scripting.Dictionary")
    
    Dim i As Long
    For i = 1 To Len(s)
        charSet(Mid$(s, i, 1)) = charSet(Mid$(s, i, 1)) + 1
    Next
    
    Set GetCharacterSet = charSet
End Function

Archives

Categories

Meta

when finding duplicates doesn’t quite cut it

Need for finding semantic similarity in Excel

Approaches Used in the VBA Code

How to Call the Function: Argument Table

How to Install This VBA Add‑In

Examples / Screenshot

VBA Code