Getting started with Treetop
I’ve been doing some work with Treetop lately. It’s a parsing expression grammar library for Ruby that claims:
Parsing expression grammars (PEGs) are simple to write and easy to maintain. They are a simple but powerful generalization of regular expressions that are easier to work with than the LALR or LR-1 grammars of traditional parser generators. There’s no need for a tokenization phase, and lookahead assertions can be used for a limited degree of context-sensitivity.
And that’s true. PEGs are, for the most part, easy to write and maintain. A couple of excellent talks at LRUG a few months ago demonstrated just how easy and useful they can be.
However, whilst PEGs may be simple in concept, I think many people have found it hard just to get started with Treetop. The documentation is sparse, and doesn’t present a good introduction for the first-time user. It certainly took me quite a while to get to the black triangle stage. So, while the struggle is fresh in my mind, I’ve decided to try to help out those who follow.
I’ve put together a project on GitHub with a couple of examples of Treetop grammars that parse HTML files. The first is very simple, and simply splits the file up into text or tags. The second, more complex, example processes opening, closing, and empty tags separately and extracts attributes.
I’ll walk you through the process of building up the simple grammar and using it to parse a file.
First, install the treetop
and
polyglot
libraries (you can do this via gem
install
). polyglot
allows you to load Treetop
grammars directly in Ruby via require
, which makes
development a bit easier.
We’re going to write a parser that simply splits an HTML
document into tags and text. Here’s a grammar file I’m going to
call simple_html.treetop
:
grammar SimpleHTML rule document (text / tag)* end rule text [^<]+ end rule tag "<" [^>]+ ">" end end
All this says is that a document consists of zero or more
tags and text. Text is anything without an angle
bracket. A tag starts with an angle bracket, contains one or more
characters that aren’t a closing angle bracket, and finishes with a
closing angle bracket. It’s important that we put the parent rule
(document
) at the top so that it is matched in
preference to the child rules that come later.
This grammar doesn’t let us get at the content easily, but it does tell us whether a document conforms to our grammar:
require "treetop" require "polyglot" require "simple_html" parser = SimpleHTMLParser.new p parser.parse("") # => SyntaxNode offset=0, "" p parser.parse("<<<<<") # => nil p parser.parse("<p>Paragraph</p>") # => SyntaxNode offset=0, "<p>Paragraph</p>": # SyntaxNode+Tag0 offset=0, "<p>": # SyntaxNode offset=0, "<" # SyntaxNode offset=1, "p": # SyntaxNode offset=1, "p" # SyntaxNode offset=2, ">" # ...
The first important point is that Treetop generates a Parser
class whose name is that of the grammar followed by
Parser
. In this case, grammar SimpleHTML
has given us class SimpleHTMLParser
.
The second point is that parse returns nil
if it
can’t parse the document.
Now, whilst it’s obvious from the SyntaxNode
output
that something is happening, it’s not immediately useful. But these
nodes are Ruby objects, and Treetop lets us define our own methods
on them. The simplest way to do this is to put some Ruby methods
inline in the grammar file, surrounded by braces:
grammar SimpleHTML rule document (text / tag)* { def content elements.map{ |e| e.content } end } end rule text [^<]+ { def content [:text, text_value] end } end rule tag "<" [^>]+ ">" { def content [:tag, text_value] end } end end
The text_value
method gives us access to the raw
content of a node. elements
lets us dig down a level
into the child nodes. If we now call the content
method on our parsed document, we get a list of text and tags,
thus:
require "treetop" require "polyglot" require "simple_html" parser = SimpleHTMLParser.new p parser.parse("<p>Paragraph</p>").content # => [[:tag, "<p>"], [:text, "Paragraph"], [:tag, "</p>"]]
We’re on our way now. By expanding the rules, we can get down into tag types and attributes, as in this more complex grammar.
Some other points are worth noting:
- You don’t have to use
polyglot
. You can use the command-linett
program to convert a Treetop grammar into Ruby. This gives better error reporting, and may save time if you have a particularly complex grammar, as it only needs to be processed once. - You can subclass SyntaxNode and define methods in a separate Ruby file. This gives better backtraces.
- Treetop is greedy in its consumption of input. This makes some things that would be easy in regular expressions quite difficult in Treetop.
Have a look at my Treetop examples if you want to get started or just play around. I hope I’ve managed to save someone a bit of time.