Fun with Trees
Paul Battley
Paul Battley
(fun may be subjective)

The problem: fuzzy string matching
Specifically, the best match from a set
Soundex, Metaphone, DoubleMetaphone
f(s) = f(t)
require 'text/metaphone' Text::Metaphone.metaphone('Battley') # => "BTL" Text::Metaphone.metaphone('Battery') # => "BTR"
Easy to index & look up
but …
Not generally applicable
Levenshtein (edit) distance
d(s,t)
require 'text/levenshtein' Text::Levenshtein.distance('Battley', 'Battery') # => 2
More useful
but …
it’s slow!
it’s slow!
d(s,t1), d(s,t2), ..., d(s,tn)
Damn Cool Algorithms, Part 1: BK-Trees
Burkhard & Keller, 1973
Some approaches to best match file searching
Me? Pay?
Baeza-Yates & Navarro, 1998
Fast Approximate String Matching in a Dictionary
Baeza-Yates & Navarro, 1998
Fast Approximate String Matching in a Dictionary
That’s more like it!
The Triangle Inequality:
In a metric space
The Triangle Inequality:
In a metric space
i.e. one where distance is defined
The Triangle Inequality:
In a metric space
i.e. one where distance is defined
d(x,z) ≤ d(x,y) + d(y,z)
Satisfied by the Levenshtein distance
We can use a BK tree
How to build a BK tree
How to build a BK tree
How to build a BK tree
How to build a BK tree
Got that?
Pictures!
d(‘birded’, ‘infighting’) = 9

d(‘inebriation’, ‘infighting’) = 7

d(‘stargazers’, ‘infighting’) = 9
d(‘stargazers’, ‘birded’) = 8








... and so on.
Querying
Querying
Querying
Querying
E.g. within distance 1 of ‘game’

d(‘game’, ‘infighting’) = 9

d(‘game’, ‘infighting’) = 9

d(‘game’, ‘birded’) = 5
d(‘game’, ‘skydove’) = 6

d(‘game’, ‘birded’) = 5
d(‘game’, ‘skydove’) = 6

7 out of 14 elements
Implementation
module BK class Node attr_reader :term, :children def initialize(term, distancer) @term = term @children = {} @distancer = distancer end def distance(term) @distancer.distance(term, self.term) end end end
def add(term) score = distance(term) if child = children[score] child.add(term) else children[score] = Node.new(term, @distancer) end end
def query(term, threshold, collected) d = distance(term) collected << self.term if d <= threshold ((threshold-d)..(threshold+d)).each do |score| child = children[score] child.query(term, threshold, collected) if child end end
module BK class Tree def initialize(distancer = LevenshteinDistancer.new) @root = nil @distancer = distancer end def add(term) if @root @root.add(term) else @root = Node.new(term, @distancer) end end def query(term, threshold) collected = [] @root.query(term, threshold, collected) return collected end end end
Real-world performance
Looking for words within distance 1 of ‘alien’ in a 20,000-word dictionary
Loading 20000 words from dictionary ... 0.273s Building tree ... 57.331s Linear scan to find expected terms ... 5.711s Query tree ... 0.133s 2.1% of tree was queried
Not too bad!
Not too bad!
About 40x as fast
Not too bad!
About 40x as fast
... but it took 10x as long to build the tree as to perform a linear search
With threshold of 2
Query tree ... 1.461s 31.1% of tree was queried
With threshold of 3
Query tree ... 3.368s 62.9% of tree was queried
Limitations
Memory usage
Memory usage
About 6 MB for a 20,000-word tree
Maximum tree depth limited by stack
Improvements
Improvements
Improvements
Improvements