Fun with Trees

Paul Battley

(fun may be subjective)

The problem: fuzzy string matching

Specifically, the best match from a set

Soundex, Metaphone, DoubleMetaphone

f(s) = f(t)

require 'text/metaphone'

Text::Metaphone.metaphone('Battley') # => "BTL"
Text::Metaphone.metaphone('Battery') # => "BTR"

Easy to index & look up

but …

Not generally applicable

Levenshtein (edit) distance

d(s,t)

require 'text/levenshtein'

Text::Levenshtein.distance('Battley', 'Battery') # => 2

More useful

but …

it’s slow!

it’s slow!

d(s,t1), d(s,t2), ..., d(s,tn)

Damn Cool Algorithms, Part 1: BK-Trees

http://tinyurl.com/2lnsx7

Burkhard & Keller, 1973

Some approaches to best match file searching

http://tinyurl.com/28b6eg

Burkhard & Keller, 1973

Some approaches to best match file searching

http://tinyurl.com/28b6eg

Me? Pay?

Baeza-Yates & Navarro, 1998

Fast Approximate String Matching in a Dictionary

http://tinyurl.com/yrkuqh

Baeza-Yates & Navarro, 1998

Fast Approximate String Matching in a Dictionary

http://tinyurl.com/yrkuqh

That’s more like it!

The Triangle Inequality:

In a metric space

The Triangle Inequality:

In a metric space

i.e. one where distance is defined

The Triangle Inequality:

In a metric space

i.e. one where distance is defined

d(x,z) ≤ d(x,y) + d(y,z)

Satisfied by the Levenshtein distance

We can use a BK tree

How to build a BK tree

How to build a BK tree

  1. Select an arbitrary element as the root node

How to build a BK tree

  1. Select an arbitrary element as the root node
  2. For each subsequent element, find the distance between it and the root

How to build a BK tree

  1. Select an arbitrary element as the root node
  2. For each subsequent element, find the distance between it and the root
  3. If there’s already a child node with that distance, recurse down it

Got that?

Pictures!

d(‘birded’, ‘infighting’) = 9

d(‘inebriation’, ‘infighting’) = 7

d(‘stargazers’, ‘infighting’) = 9
d(‘stargazers’, ‘birded’) = 8

... and so on.

Querying

Querying

  1. Choose a search term s and a maximum distance a

Querying

  1. Choose a search term s and a maximum distance a
  2. Starting at the root, calculate the distance b between the search term and the node term

Querying

  1. Choose a search term s and a maximum distance a
  2. Starting at the root, calculate the distance b between the search term and the node term
  3. Recurse over child nodes in the range
    (b-a) ≤ x ≤ (b+a)

E.g. within distance 1 of ‘game’

d(‘game’, ‘infighting’) = 9

d(‘game’, ‘infighting’) = 9

d(‘game’, ‘birded’) = 5
d(‘game’, ‘skydove’) = 6

d(‘game’, ‘birded’) = 5
d(‘game’, ‘skydove’) = 6

7 out of 14 elements

Implementation

module BK
  class Node
    attr_reader :term, :children
    
    def initialize(term, distancer)
      @term = term
      @children = {}
      @distancer = distancer
    end
    
    def distance(term)
      @distancer.distance(term, self.term)
    end  
  end
end
def add(term)
  score = distance(term)
  if child = children[score]
    child.add(term)
  else
    children[score] = Node.new(term, @distancer)
  end
end
def query(term, threshold, collected)
  d = distance(term)
  collected << self.term if d <= threshold
  ((threshold-d)..(threshold+d)).each do |score|
    child = children[score]
    child.query(term, threshold, collected) if child
  end
end
module BK
  class Tree
    def initialize(distancer = LevenshteinDistancer.new)
      @root = nil
      @distancer = distancer
    end
  
    def add(term)
      if @root
        @root.add(term)
      else
        @root = Node.new(term, @distancer)
      end
    end
  
    def query(term, threshold)
      collected = []
      @root.query(term, threshold, collected)
      return collected
    end
  end
end

Real-world performance

Looking for words within distance 1 of ‘alien’ in a 20,000-word dictionary

Loading 20000 words from dictionary ... 0.273s
Building tree ... 57.331s
Linear scan to find expected terms ... 5.711s
Query tree ... 0.133s
2.1% of tree was queried

Not too bad!

Not too bad!

About 40x as fast

Not too bad!

About 40x as fast

... but it took 10x as long to build the tree as to perform a linear search

With threshold of 2

Query tree ... 1.461s
31.1% of tree was queried

With threshold of 3

Query tree ... 3.368s
62.9% of tree was queried

Limitations

Memory usage

Memory usage

About 6 MB for a 20,000-word tree

Maximum tree depth limited by stack

Improvements

Improvements

  • Wrap simple objects

Improvements

  • Wrap simple objects
  • Store computed tree

Improvements

  • Wrap simple objects
  • Store computed tree
  • Delete nodes