I’ve been doing some work with Treetop lately. It’s a parsing expression grammar library for Ruby that claims:

Parsing expression grammars (PEGs) are simple to write and easy to maintain. They are a simple but powerful generalization of regular expressions that are easier to work with than the LALR or LR-1 grammars of traditional parser generators. There’s no need for a tokenization phase, and lookahead assertions can be used for a limited degree of context-sensitivity.

And that’s true. PEGs are, for the most part, easy to write and maintain. A couple of excellent talks at LRUG a few months ago demonstrated just how easy and useful they can be.

However, whilst PEGs may be simple in concept, I think many people have found it hard just to get started with Treetop. The documentation is sparse, and doesn’t present a good introduction for the first-time user. It certainly took me quite a while to get to the black triangle stage. So, while the struggle is fresh in my mind, I’ve decided to try to help out those who follow.

I’ve put together a project on GitHub with a couple of examples of Treetop grammars that parse HTML files. The first is very simple, and simply splits the file up into text or tags. The second, more complex, example processes opening, closing, and empty tags separately and extracts attributes.

I’ll walk you through the process of building up the simple grammar and using it to parse a file.

First, install the treetop and polyglot libraries (you can do this via gem install). polyglot allows you to load Treetop grammars directly in Ruby via require, which makes development a bit easier.

We’re going to write a parser that simply splits an HTML document into tags and text. Here’s a grammar file I’m going to call simple_html.treetop:

grammar SimpleHTML
  rule document
    (text / tag)*
  end

  rule text
    [^<]+
  end

  rule tag
    "<" [^>]+ ">"
  end
end

All this says is that a document consists of zero or more tags and text. Text is anything without an angle bracket. A tag starts with an angle bracket, contains one or more characters that aren’t a closing angle bracket, and finishes with a closing angle bracket. It’s important that we put the parent rule (document) at the top so that it is matched in preference to the child rules that come later.

This grammar doesn’t let us get at the content easily, but it does tell us whether a document conforms to our grammar:

require "treetop"
require "polyglot"
require "simple_html"

parser = SimpleHTMLParser.new

p parser.parse("")
# => SyntaxNode offset=0, ""

p parser.parse("<<<<<")
# => nil

p parser.parse("<p>Paragraph</p>")
# => SyntaxNode offset=0, "<p>Paragraph</p>":
#      SyntaxNode+Tag0 offset=0, "<p>":
#        SyntaxNode offset=0, "<"
#        SyntaxNode offset=1, "p":
#          SyntaxNode offset=1, "p"
#        SyntaxNode offset=2, ">"
#        ...

The first important point is that Treetop generates a Parser class whose name is that of the grammar followed by Parser. In this case, grammar SimpleHTML has given us class SimpleHTMLParser.

The second point is that parse returns nil if it can’t parse the document.

Now, whilst it’s obvious from the SyntaxNode output that something is happening, it’s not immediately useful. But these nodes are Ruby objects, and Treetop lets us define our own methods on them. The simplest way to do this is to put some Ruby methods inline in the grammar file, surrounded by braces:

grammar SimpleHTML
  rule document
    (text / tag)* {
      def content
        elements.map{ |e| e.content }
      end
    }
  end

  rule text
    [^<]+ {
      def content
        [:text, text_value]
      end
    }
  end

  rule tag
    "<" [^>]+ ">" {
      def content
        [:tag, text_value]
      end
    }
  end
end

The text_value method gives us access to the raw content of a node. elements lets us dig down a level into the child nodes. If we now call the content method on our parsed document, we get a list of text and tags, thus:

require "treetop"
require "polyglot"
require "simple_html"

parser = SimpleHTMLParser.new

p parser.parse("<p>Paragraph</p>").content
# => [[:tag, "<p>"], [:text, "Paragraph"], [:tag, "</p>"]]

We’re on our way now. By expanding the rules, we can get down into tag types and attributes, as in this more complex grammar.

Some other points are worth noting:

  • You don’t have to use polyglot. You can use the command-line tt program to convert a Treetop grammar into Ruby. This gives better error reporting, and may save time if you have a particularly complex grammar, as it only needs to be processed once.
  • You can subclass SyntaxNode and define methods in a separate Ruby file. This gives better backtraces.
  • Treetop is greedy in its consumption of input. This makes some things that would be easy in regular expressions quite difficult in Treetop.

Have a look at my Treetop examples if you want to get started or just play around. I hope I’ve managed to save someone a bit of time.