Interview Question: Grouping Words Into Anagrams

This was a question I was recently asked on a technical phone screen:

Given a list of words, group them into anagrams. For example given ["tsar", "rat", "tar", "star", "tars", "cheese"] output [["tsar", "star", "tars"],["rat", "tar"],["cheese"]], although the order in the output does not matter.

Algorithmically, the problem boils down to finding a function pre_hash from the set of words to some hashable set (e.g. integers or words) such that pre_hash(x)=pre_hash(y) if and only if x is an anagram of y. The name of the function is inspired by the fact that its result is passed to a hash function for use in a hash table. Once we have the pre_hash it is a matter of keeping one list (or set in our implementation, since Python makes it easy to use sets) per possible pre-hash value. I came up with two possible pre-hash functions. They are discussed below, and an implementation of both is given below.

The first pre-hash function that I could think of was to simply sort the letters in the word. It is easy to notice that two anagrams would sort to the same word. For example, calling this pre-hash function sort for now, we have sort("star")=sort("tars")="arst". This function costs O(mlog(m))O(m \log(m)) where mm is the length of the word being pre-hashed. If the list contains only short words, say m<8m<8 , on average (which would be the case if the words are randomly selected English words, for example) then the overhead is likely negligible. This is implemented in the code below as the sort_prehash function.

The second choice for pre_hash would take linear time, O(m)O(m) , and is achieved by counting how many times each character occurs in the word. It is implemented below as count_letters_prehash, which returns a list of pairs of the form (letter, occurrence), for example ("a",2) if "a" occurs twice in the word. This choice will be more efficient if the words are longer. It is implemented using Python's collections.Counter. Note that lists are mutable and hence not hashable in Python, so we must convert the list of tuples into a tuple of tuples before returning the result.

Here's the solution, using both of the above pre-hash functions to solve the same problem.

import collections


def sort_prehash(word):
    return ''.join(sorted(word))


def count_letters_prehash(word):
    return tuple(collections.Counter(word).items())


def group_anagrams(words, hash_function):
    result = {}
    for w in words:
        s = hash_function(w.lower())
        if s in result:
            result[s] |= {w}
        else:
            result[s] = {w}
    return result.values()

# Usage:
>>> group_anagrams(['tsar', 'rat', 'tar', 'star', 'tars', 'cheese'], sort_prehash)
[set(['tars', 'tsar', 'star']), set(['cheese']), set(['rat', 'tar'])]
>>> group_anagrams(['tsar', 'rat', 'tar', 'star', 'tars', 'cheese'], count_letters_prehash)
[set(['tars', 'tsar', 'star']), set(['cheese']), set(['rat', 'tar'])]

Comments