Chasing moving targets

At work, I’ve spent a reasonable chunk of the past two weeks implementing an hReview parser. To briefly explain, hReview, along with other microformats, is a way to mark up a standard web page invisibly in such a way that the contents can be read and aggregated by a computer.

It all began with a task to improve reevoo.com’s current parser so that it could accommodate multiple photos for a single review. However, since one of the ‘business value’ points on the task card was to develop a reference parser, I decided to go back a step and develop a generalised parser that could handle some of the other microformats.

Once I had something that could process the basic nested structure that many of the microformats adhere to, I could set out the structure and parse about 90% of the hReview and—because hReview relies on it—hCard specifications. Unfortunately, that left a hard core of fields that needed special consideration due to weird or lax parts the specification. (E.g. The telephone number goes inside a ‘value’ element. Or not. It depends.) Still, with a set of test documents, I worked my way through the edge cases until I could extract 100% of information from all the valid documents I could find.

hCard was particularly horrible, due in no small part to the unwillingness of the specification’s authors to actually, well, specify it.

Is there a list of all hCard properties which can be plural?

We have avoided duplicating (or providing a shortcut for) the “can this property occur multiple times or not” deliberately in order to avoid repeating a constraint from RFC 2426 vCard, and thus potentially getting it wrong. Here is the way to determine whether or not a particular property can occur multiple times (is a plural property / may have multiple instances or values).

  1. Check the hCard XMDP profile for the property definition.
  2. If the property definition references a plural form in RFC 2426 (e.g. honorific-suffix references honorific suffixes), then the property is a plural property.
  3. Else go check the referenced section in RFC 2426 which should state explicitly whether or not the property is plural or singular.
  4. Else (if RFC 2426 is not explicit) then the property is plural.

What a cop-out! RFC 2426 is, by the way, horrible to read. Surely discussing the specification in public would be better than expecting every implementor to follow that convoluted process and—magically—to come up with interoperable software. That (miserable excuse for a) specification is a guaranteed path to incompatible implementations. Incompatible implementations lead to anger. Anger leads to hate. The path to the Dark Side, that is. Or something.

Parsing hCards is particularly difficult: most of the few seen in the wild are broken in some way, and some of their information can’t be extracted automatically. Still, despite these difficulties, I succeeded, and could parse every well-formed hReview and hCard perfectly.

And there our story would end. Except … a day later, they released the revised version 0.3 specification! This is admittedly a predictable problem when working with draft specifications. My parser could handle reviews written according to version 0.3, with one small omission: the actual rating of the review itself. From our point of view, this is one of the most important parts. Still, it was easy to fix.

However, the latest draft of the specification blithely throws in a couple of hand grenades: in addition to rel-tag and hCard, a comprehensive hReview parser now needs to understand hCalendar, rel-license, and include-pattern . Adding those is a bit more work.

Whilst I understand the microformats.org crowd’s love for their own product, I do have to wonder whether they aren’t raising the barrier to implementation a bit too high by making all the microformats interdependent. Now, in order to handle one specification you’re interested in, you must first implement five that are not directly relevant to the current task. That’s not a microformat any longer.

We’re going to release the hReview parser for public use in the next few days (update: it’s here), I hope, by which time I should have added the latest changes. Partly, it’s good publicity, but I think there’s also a big advantage in being among the first to implement it: we get to say what’s valid and what isn’t!

Comments

  1. Tantek Çelik

    Wrote at 2006-03-03 02:10 UTC using Firefox 1.0.7 on Mac OS X:

    Hi Paul,

    First of all, this is really good feedback and I very much appreciate it. Thanks very much for taking the time to blog about it.

    Very happy to hear that you are working on an hReview parser.

    Your points about RFC2426 are very well taken. As one who has read that document several times and had to interpret various portions, I still find it challenging, and I am starting to lean more towards your perspective that perhaps clarifying an existing specification may be quite helpful.

    Regarding the hCards that you found “in the wild … broken in some way”, definitely note that next to the link to the example in the wild hCard in the hCard specification, with hopefully a few words about how you thought it was broken.

    The more we know about precisely how people are getting things wrong, the more we can improve the FAQs, and perhaps even some of the parsers/converters (such as X2V) to catch common errors and notify the author accordingly.

    And regarding hReview 0.3. First, I invite you, as a microformat developer to join the microformats-dev mailing list:

    http://microformats.org/discuss

    Second, the “rating” property was actually in hReview from the very first version (v0.1) so it is not something new. Third, if your parser supported hReview 0.2, and you found that to be fine, that’s not a problem. As you observed, hReview 0.3 is fairly new, and we are still getting publishing experience with it and tweaking little details here and there. If you do decide to go ahead and implement the new pieces in hReview 0.3, I’d really like to see your feedback on what challenges you encountered, on the microformats-dev list.

    I have to take a slight exception at your use of “interdependent”, because that’s not entirely correct since hReview depends on hCard (and hCalendar) but not vice versa. That’s just dependent. And the dependence is deliberate to actually reuse those specifications as building blocks and thus minimize the number of “new” things defined by hReview. That’s one of the essential pieces of a microformat is that it tries to build upon existing microformats as much as possible, and introduce as few new terms as possible. That’s one of the aspects that makes it “micro” – minimizing the amount of invention and new vocabulary.

    Finally, you’re absolutely right about the advantages of implementing microformats parsers sooner than later. In order to make sure that we’re not miscommunicating anything in the various specs you have implemented, I definitely encourage you to join the microformats-discuss mailing list and encourage feedback on your implementation.

    Once again, thanks for both your time and your hard work, and I look forward to hearing more details of your experience, hopefully on the mailing list.

    Thanks,

    Tantek