More on transliteration

Author
Published

June 22, 2025

In the last post on transliteration, I introduced the idea of transliteration as implemented in programming, and pointed out that the process of transforming text is more general. In that regard, the implementation that works for one use case will work for another. Now, the question is what is the most efficient and appropriate implementation?

I talked about a prefix tree as easy for storing the sub-/strings to match on. However, in my pure Clojure implementation of a prefix tree, which is implemented using nested maps, the performance is slow. Very slow! But that’s not a reflection of Clojure, which is a language that is very practical and optimizes what it can. And the ethos of Clojure programming follows the maxim in programming, stemming from early Unix, of “make it work, make it right, make it fast”. As such, we should think about how to make this fast.

Tim asked me why this text transformation couldn’t have been implemented in a regex, and doing so would certainly make it fast. For example, to transliterate Tamil language text in Latin script into the Tamil script, my existing implementation would look like:

(def s "vaNakkam. padippavarkaLukku n-anRi.")
(def expected "வணக்கம். படிப்பவர்களுக்கு நன்றி.")
(cvt/romanized->தமிழ் s)
"வணக்கம். படிப்பவர்களுக்கு நன்றி."
(assert (= expected (cvt/romanized->தமிழ் s)))
nil

That transliteration is converting Latin script into Tamil script in a somewhat predictable and intuitive way, such that: a -> அ, aa -> ஆ, …, k -> க், ng -> ங், etc. Tim’s point is that you can detect the input substrings using the regex, and then feed the matching substring occurrences into a replacement map to get the translation. His previous pseudocode in JS looked like this:

let text = "this is a test";
const replacementMap = { 'th': 'X', 't': 'Y' };

let result = text.replace(/th|t/g, (match) => {
                                               return replacementMap[match];
                                               });

console.log(result);

He is taking into account the caveat that some of the substrings will overlap or be a superstring of other substrings, and therefore, order matters so that the right “rule” (match + replace) is triggered.

This should work. Let’s try it. In the “romanized->தமிழ்” function, where the word “romanized” really should be “Latin” for the name of the script, the conversions are more or less defined here: Let’s just reuse it!

cvt/romanized-தமிழ்-phoneme-map
{"d" "ட்",
 "n" "ன்",
 "ee" "ஏ",
 "z" "ழ்",
 "w" "ந்",
 "aa" "ஆ",
 "ii" "ஈ",
 "s" "ச்",
 "uu" "ஊ",
 "e" "எ",
 "zh" "ழ்",
 "ch" "ச்",
 "q" "ஃ",
 "L" "ள்",
 "p" "ப்",
 "oo" "ஓ",
 "nth" "ந்த்",
 "v" "வ்",
 "mb" "ம்ப்",
 "E" "ஏ",
 "R" "ற்",
 "a" "அ",
 "t" "ட்",
 "ai" "ஐ",
 "U" "ஊ",
 "ng" "ங்",
 "O" "ஓ",
 "i" "இ",
 "k" "க்",
 "b" "ப்",
 "r" "ர்",
 "y" "ய்",
 "g" "க்",
 "l" "ல்",
 "N" "ண்",
 "u" "உ",
 "A" "ஆ",
 "I" "ஈ",
 "m" "ம்",
 "th" "த்",
 "nj" "ஞ்",
 "o" "ஒ",
 "au" "ஔ",
 "nr" "ன்ற்",
 "nd" "ண்ட்",
 "n-" "ந்"}

Now to handle the caveat. As you can see, "t" is a substring of "th", and both are keys in the map. We effectively have to do a topological sort or some other graph traversal based on which keys are substrings of which other ones. In this particular case, a shortcut that is a huge hack (because it cannot possibly be generalizable) would be to sort the match strings in order of longest to shortest en route to constructing our regex string:

(->> (keys cvt/romanized-தமிழ்-phoneme-map)
     (sort-by count)
     reverse)
("nth"
 "n-"
 "nd"
 "nr"
 "au"
 "nj"
 "th"
 "ng"
 "ai"
 "mb"
 "oo"
 "ch"
 "zh"
 "uu"
 "ii"
 "aa"
 "ee"
 "o"
 "m"
 "I"
 "A"
 "u"
 "N"
 "l"
 "g"
 "y"
 "r"
 "b"
 "k"
 "i"
 "O"
 "U"
 "t"
 "a"
 "R"
 "E"
 "v"
 "p"
 "L"
 "q"
 "e"
 "s"
 "w"
 "z"
 "n"
 "d")

Our regex string will end up looking like:

(->> (keys cvt/romanized-தமிழ்-phoneme-map)
     (sort-by count)
     reverse
     (interpose \|)
     (apply str))
"nth|n-|nd|nr|au|nj|th|ng|ai|mb|oo|ch|zh|uu|ii|aa|ee|o|m|I|A|u|N|l|g|y|r|b|k|i|O|U|t|a|R|E|v|p|L|q|e|s|w|z|n|d"

Our regex would be formed by feeding it to re-pattern:

(def regex (re-pattern (->> (keys cvt/romanized-தமிழ்-phoneme-map)
                            (sort-by count)
                            reverse
                            (interpose \|)
                            (apply str))))

We can do segmentation on the input string based on the transliteration/transformation substring match keys:

(re-seq regex s)
("v"
 "a"
 "N"
 "a"
 "k"
 "k"
 "a"
 "m"
 "p"
 "a"
 "d"
 "i"
 "p"
 "p"
 "a"
 "v"
 "a"
 "r"
 "k"
 "a"
 "L"
 "u"
 "k"
 "k"
 "u"
 "n-"
 "a"
 "n"
 "R"
 "i")

We can’t naively just transform the strings that match, however. Ex: you would lose the whitespace and punctuation in this example.

(->> (re-seq regex s)
     (map cvt/romanized-தமிழ்-phoneme-map)
     fmt/phonemes->str)
"வணக்கம்படிப்பவர்களுக்குநன்றி"

So we need to adjust our regex to be smart enough to have a “default branch” that matches the next character if nothing else matches. We do this by appending the match all shortcut . to the end of the giant pattern alternation:

(def regex (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map)
                                 (sort-by count)
                                 reverse
                                 (interpose \|)
                                 (apply str))
                            "|.")))

Now, we get non-matching characters in the output

(->> (re-seq regex s)
     (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
     fmt/phonemes->str)
"வணக்கம். படிப்பவர்களுக்கு நன்றி."

And for that matter, since the . regex alternation pattern matches a single character anyways, and you’re always doing a lookup on what is returned by the regex, we can remove any 1-character length strings from the regex pattern without change in functionality:

(def regex (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map)
                                 (sort-by count)
                                 reverse
                                 (remove #(= 1 (count %)))
                                 (interpose \|)
                                 (apply str))
                            "|.")))

Check that the output is the same:

(->> (re-seq regex s)
     (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
     fmt/phonemes->str)
"வணக்கம். படிப்பவர்களுக்கு நன்றி."

Let’s see that the new regex is faster than the slightly older regex, and that they are indeed faster than the unoptimized pure Clojure prefix tree implementation.

(def regex1 (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map)
                                 (sort-by count)
                                 reverse
                                 (interpose \|)
                                 (apply str))
                            "|.")))
(def regex2 (re-pattern (str (->> (keys cvt/romanized-தமிழ்-phoneme-map)
                                 (sort-by count)
                                 reverse
                                 (remove #(= 1 (count %)))
                                 (interpose \|)
                                 (apply str))
                            "|.")))
(def NUM-REPS 100)
(time (dotimes [_ NUM-REPS]
        (cvt/romanized->தமிழ் s)))
OUT
"Elapsed time: 257.629737 msecs"
nil
(time (dotimes [_ NUM-REPS]
        (->> (re-seq regex1 s)
             (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
             fmt/phonemes->str)))
OUT
"Elapsed time: 118.929934 msecs"
nil
(time (dotimes [_ NUM-REPS]
        (->> (re-seq regex2 s)
             (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
             fmt/phonemes->str)))
OUT
"Elapsed time: 90.454022 msecs"
nil

Well, this is surprising. I assumed that the regex implementation would be significantly faster. Let’s try to investigate.

Maybe the difference is less than we thought because fmt/phonemes->str is suspiciously inefficient (and also based on the prefix tree code). So what if we strike that out from the above expressions that were timed?

(time (dotimes [_ NUM-REPS]
        (->> (re-seq regex2 s)
             (map #(or (cvt/romanized-தமிழ்-phoneme-map %) %))
             str/join)))
OUT
"Elapsed time: 2.067643 msecs"
nil

So fmt/phonemes->str is the culprit. And the implementation of it uses prefix tree code, which is ripe for optimization, perhaps similar to what we just proved here?

source: src/internationalization/transliteration2.clj