How to find anagrams

Here’s a very quick method for finding anagrams that I came up with this afternoon. It’s fairly straightforward, but also very efficient and, I think, interesting, so I’m sharing it here.

First, take a word list (mine was in /usr/share/dict/words, but you could use any list you like).

The next thing to do is to calculate a 26-dimensional vector for each word. That’s a lot less complicated than it sounds. The n ^th value is given by the frequency of letter n in the word. For example, if we take the word aardvark, then we have 3 As, 1 D, 1 K, 2 Rs and 1 V. The vector is thus:

[3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0]

This vector will be the same for any word (or garbage string) that contains the same letters in the same quantities; i.e. for any anagram.

Now, all we have to do is to index our words according to this vector, and we can quickly look up all anagrams of a given jumble of letters.

What I did was to concatenate all 26 values into a single text string, using “a” to indicate zero, “b” for one, and so on. Thus, the encoded vector for aardvark is as follows:

daabaaaaaabaaaaaacaaabaaaa

It’s not necessarily the best method, but it has the advantage of being plain text and easy to deal with. I simply inserted every word along with its corresponding key into a database table, indexing on the encoded vector.

It took a few minutes to generate the key for each of 200,000 words and to insert them into the database. However, once that is achieved, retrieval is almost instantaneous. Create an encoded vector for the source letters, and find all corresponding entries in the database.