Thursday, 21 July 2011

Some string matching

This puzzle from Facebook's engineering puzzles page is a good introduction to string matching.
A solution is possible using an algorithm called Levenshtein distance.

Let's take the example on the puzzle page -
TIHS SENTENTCNES ISS NOUT VARRRY GOUD

It's a munged up version of
THIS SENTENCE IS NOT VERY GOOD
The original problem statement involves minimizing the score (number of changes necessary) to transform each word in the munged up sentence into words that are in a predefined wordlist. It says nothing about grammatical correctness of the result.

So going by that, the score and the output sentence from my implementation are respectively, 8 and
TICS SENTENCES IDS NOT VAGARY GAUD

From the standpoint of the original problem, this solution is correct.
But not from a grammatical point of view. After some poking around, I realized it's not possible for this program to spew out the correct version of the sentence without it being aware of English grammar rules.


P.S. Without getting into grammar correction, of which I have no idea how to implement, I made a small change in the program as an experiment. The second version favours a smaller word whenever the scores for two are equal.
And the result is (Score remains the same)
TIS SENTENCES IS NOT VARY GOD