At work, I’ve spent a reasonable chunk of the past two weeks implementing an hReview parser. To briefly explain, hReview, along with other microformats, is a way to mark up a standard web page invisibly in such a way that the contents can be read and aggregated by a computer.

It all began with a task to improve’s current parser so that it could accommodate multiple photos for a single review. However, since one of the ‘business value’ points on the task card was to develop a reference parser, I decided to go back a step and develop a generalised parser that could handle some of the other microformats.

Once I had something that could process the basic nested structure that many of the microformats adhere to, I could set out the structure and parse about 90% of the hReview and—because hReview relies on it—hCard specifications. Unfortunately, that left a hard core of fields that needed special consideration due to weird or lax parts the specification. (E.g. The telephone number goes inside a ‘value’ element. Or not. It depends.) Still, with a set of test documents, I worked my way through the edge cases until I could extract 100% of information from all the valid documents I could find.

hCard was particularly horrible, due in no small part to the unwillingness of the specification’s authors to actually, well, specify it.

Is there a list of all hCard properties which can be plural?

We have avoided duplicating (or providing a shortcut for) the “can this property occur multiple times or not” deliberately in order to avoid repeating a constraint from RFC 2426 vCard, and thus potentially getting it wrong. Here is the way to determine whether or not a particular property can occur multiple times (is a plural property / may have multiple instances or values).

  1. Check the hCard XMDP profile for the property definition.
  2. If the property definition references a plural form in RFC 2426 (e.g. honorific-suffix references honorific suffixes), then the property is a plural property.
  3. Else go check the referenced section in RFC 2426 which should state explicitly whether or not the property is plural or singular.
  4. Else (if RFC 2426 is not explicit) then the property is plural.

What a cop-out! RFC 2426 is, by the way, horrible to read. Surely discussing the specification in public would be better than expecting every implementor to follow that convoluted process and—magically—to come up with interoperable software. That (miserable excuse for a) specification is a guaranteed path to incompatible implementations. Incompatible implementations lead to anger. Anger leads to hate. The path to the Dark Side, that is. Or something.

Parsing hCards is particularly difficult: most of the few seen in the wild are broken in some way, and some of their information can’t be extracted automatically. Still, despite these difficulties, I succeeded, and could parse every well-formed hReview and hCard perfectly.

And there our story would end. Except … a day later, they released the revised version 0.3 specification! This is admittedly a predictable problem when working with draft specifications. My parser could handle reviews written according to version 0.3, with one small omission: the actual rating of the review itself. From our point of view, this is one of the most important parts. Still, it was easy to fix.

However, the latest draft of the specification blithely throws in a couple of hand grenades: in addition to rel-tag and hCard, a comprehensive hReview parser now needs to understand hCalendar, rel-license, and include-pattern . Adding those is a bit more work.

Whilst I understand the crowd’s love for their own product, I do have to wonder whether they aren’t raising the barrier to implementation a bit too high by making all the microformats interdependent. Now, in order to handle one specification you’re interested in, you must first implement five that are not directly relevant to the current task. That’s not a microformat any longer.

We’re going to release the hReview parser for public use in the next few days (update: it’s here), I hope, by which time I should have added the latest changes. Partly, it’s good publicity, but I think there’s also a big advantage in being among the first to implement it: we get to say what’s valid and what isn’t!